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PREFACE 


This, the eleventh conference devoted to parallel processing, marks’ the 
beginning of our second decade. The history of parallel processing and what 
transpired during the first decade were discussed at the keynote session of the 
1981 conference. It is appropriate at this year’s keynote to learn about a 
development which will have a major impact during the second decade - VHSIC, a 
program of the Department of Defense to develop very dense VLSI chips. This 
effort will also improve the capability of several integrated circuit 
manufacturers. While denser VLSI will let us build more powerful parallel 
processors it raises the question of how best to use this new capability. 


As in previous years, the conference received many papers from around the world. 
Of the 124 papers submitted, 46 came from 14 countries outside of the United 
States. There were many papers of very high quality - far too many to be 
accomodated at the conference. Final selection of the 67 contributed papers to 
be presented was very difficult. To fit all these papers into the schedule we 
were forced to ask the authors of twenty regular papers to condense their 
material to our short paper format. We regret that we could not accept more 
papers and that we had to have a number of them trimmed down. Attendees at 
previous conferences have indicated a preference for maintaining our tradition 
of no parallel sessions so the schedule is tight and we can only accomodate a 
limited number of papers and give them a limited amount of time. The conference 
benefits greatly from this intense competition. We sincerely thank the authors 
of all submissions for their time and effort. 


We owe a deep debt of gratitude to the 153 referees who took time out from their 
normal duties to evaluate the manuscripts we sent them and give us their 
opinions. The job of selecting papers would have been impossible without their 
help. 


The program committee thanks Goodyear Aerospace Corporation and Kent State 
University for their cooperation and support of our committee work. A number of 
individuals helped us with our work including Lynne Brocco, Hazeljean Cheeseman, 
Bob Cronauer, Pat Hawkins, Carl Mickelson, Martha Moffat, Jan Pavkov, Carole 
Rey, and Elizabeth Young. We also extend thanks to the mail service at Goodyear 
Aerospace for ably handling the extra load we gave them (we received and 
dispatched over 1000 pieces of mail in connection with our work). 


Kenneth E. Batcher —- Goodyear Aerospace 
Willard C. Meilander - Goodyear Aerospace 
Jerry L. Potter - Kent State University 
1982 ICPP Program Committee 
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THE VHSIC PROGRAM AND ITS IMPACT ON PARALLEL PROCESSING 
Dr. D. W. Burlage 
Acting Deputy Director 


VHSIC Program, OUSDRE 
Department of Defense 


ABSTRACT 


The VHSIC Program is now approaching the midpoint of its Phase I effort 
that involves establishment of pilot line capabilities for 28 complex 
silicon chips employing 1.25 micrometer or smaller feature size 
processing. These chips, which will be applied in system brassboards 
for electro-optical, communication, acoustical, missile guidance, 
electronic warfare, and radar signal processiny functions, are amenable 
to highly parallel system architectures. With one of the chip sets, for 
example, two chip types are employed to provide a system with 32 
parallel processors, each with more than 20,000 gates, to obtain a 
system capable of several billion operations per second. In effecting 
these architectures, it is becoming apparent that new combinations of 
skills and technologies are required to realize the potential of this 
new generation of VLSI, and that the greatest challenges are now in 


system design, not in device technology. 


DESIGN AND PERFORMANCE OF A GENERAL CLASS OF INTERCONNECTION NETWORKS 


Laxmi N. Bhuyan 
Department of Electrical Engineering 
University of Manitoba, Winnipeg 
Manitoba, Canada R3T 2N2 


ABSTRACT 


This paper introduces a general class of 
self routing Interconnection Networks for tightly 
coupled multiprocessor systems. The proposed 
network called "Radix Shuffle Network (RSN)" is 
based on a new interconnection pattern called 
Radix Shuffle and is capable of connecting any 
number of processors M to any number of memory 
modules N. The technique results in a variety of 
Interconnection Networks depending on how M and N 
are factorized. The network covers a broad spec- 
trum of interconnections starting from shared bus 
to crossbar switches and various Multistage In- 
terconnection Networks (MINs). The permutation 
capabilities of such networks are outlined. The 
performance of the networks with respect to their 
Bandwidth and cost is analyzed and compared with 
that of a crossbar. Design procedures for ob- 
taining an optimal network with highest cost ef- 
ficiency is also presented. 


I. INTRODUCTION 


The performance of a tightly coupled multi- 
processor system rests primarily on an efficient 


design of the network interconnecting the proces- 


sors to the memory modules. A crossbar switch 
[1] allows all possible one to one connections 
between the processors and memories but, the cost 
grows rapidly with increase in the network size. 
As an alternative to,crossbar, Multistage Inter- 
connection Network (MINs), both nonblocking and 
blocking have assumed paramount importance in re- 
cent times [2-11]. A MIN is basically a blocking 
network which does not allow all possible permu- 
tations but is far less expensive compared to a 
crossbar switch. A conflict arises when two or 
more processors need the same link between two 
successive stages in reaching their destinations. 
Due to this interference, a subset of processors 
might be blocked thus giving a degradation in 
performance. Band Width (BW) of a network is de- 
fined as the expected number of memory modules 
remaining busy in a cycle or the number of memory 
requests accepted per cycle. Clearly, this is a 
parameter which specifies to what extent a net- 
work is efficient. In a crossbar, all the memory 
requests are accepted as long as no two or more 
processors address the memory module. Ina ran- 
dom mode of request, the memory BW of even a 
crossbar is much less than the actual capacity 
[12]. In a MIN, this value ought to be still 
less because of additional conflicts in the 
network. The interference analysis of such 
networks have been reported in a few papers 
recently [7,13,14]. 


The usual design of a MIN employs 2x2 
switching elements. However, with the advance- 
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ment in LSI technology, it might be better to 
employ a larger module if the network performance 
could be improved. It is also known that for a 
crossbar the BW increases with increase in the 
number of memory modules [12]. So a study on the 
design of MxN MIN with N>M seems appropriate. 
Patel's Delta network [7] is a logical approach 
in this direction. Delta network is a self 
routing (digit controlled) network connecting a 

inputs to b outputs through axb crossbar switch- 
es at each stage. Networks like Omega [4], 
Indirect binary n-cube [6], Baseline [8] etc. 
form a subset of Delta networks with a=b=2. 


This paper presents a still broader class of 
networks called Radix Shuffle Network (RSN). It 
connects M inputs to N outputs, for any arbitrary 
values of M and N. As a result, the existing 
general network like Delta becomes a special case 
of the proposed RSN. A Mxl RSN represents a 
shared bus multiprocessor system and a single 
Stage MxN RSN represents a crossbar’ switch. 
Although several networks can be obtained by 
constructing de-multiplexer trees [7] from inputs 
to the outputs, a new interconnection pattern 
called "Radix shuffle" will be considered 
throughout this paper. 


II. A MIXED RADIX NUMBER SYSTEM 


Let M be a decimal number and be represented 
as a product of r factors as 


M = m) XM, xX... xX Mm, 

Then, each number x between 0 to M-1 can be 
expressed as a r-tuple (x, x, ... x. ... x_) for 
OSx.S$(m.-1). x is the teadt significant digit 
and x, is the most significant digit. Associated 
with each Xs is a weight Ws such that 


2 xX.W. = x and w, = a for all 1SiSr. 
ii i mm m. 
i=1 12 i 
Me = — 7. = 1 always 
yBg-- -™, 
The lowest number 010 = (0 0... 0) and the high- 
est number (M-1) 5 = (m,-1, mj-1,...,m,-1). 


Whenever needed, a number x will be repre- 
sented as (x) to specify the radix in- 


prmgoes -M, 

volved. 

Example 1. Let M=6=3 x2 
nm, = 34 m, = 25 vw, = 2, w, = 1 
050 = (00), lio = (01), 250 = (10) 
340 = (11), 450 = (20), 340 = (21) 


This mixed radix number system forms the basis of 
our interconnection. Although the same radix 
system was used for Omega Networks of Lawrie [5], 
there are two basic differences between the pro- 
posed RSN and Omega. First, RSN is a MxN network 
for any arbitrary values of M and N as against a 
NxN Omega Network. Secondly, the interconnection 
pattern between two stages of our RSN is com- 
pletely different and is based on a new term 
"Radix Shuffle" as defined below. 


Definition 1 In the above mixed radix system, 
the radix shuffle of a number x = (X)X5-+ + ) 


My >+ ++ 9M will be defined as Sx = (x 


m-,m.,...,m ,m.. 
20 <3? a tale | 


7 ime . sie 1 


Example 2 For M = 3 x 4, the numbers are repre- 
sented from 00 to 23. Any (x) X, 3.4 will be 
3 


connected to S ex = (xX, 4,3 as shown in Fig. 1. 


The connection procedure is as follows. 


Number the inputs in radix (3,4) and outputs 
in (4,3). Make a perfect shuffle of the input 
and connect to the particular output. 


Definition 2 oe a m-shuffle of an integer x is 
given by, 


5 


m > BX mod(M-1) for OSx<M-1 


= x for x = M-1 


As the example, Fig. 1 again shows a 3-shuf- 
fle of 12 inputs. There is a definite relation- 
ship between “radix shuffle" and m-shuffle as 
stated in the following theorem. 


Theorem 1 A_ radix shuffle of (x) is 
— m,,i,,...,m 
sa as 
identical to a m,~shuffle of x. 
Proof of above theorem is omitted because 


of space restriction. 
III. A RADIX SHUFFLE NETWORK (RSN) 


Let M and N be represented as products of r- 
terms as M=m,xm,x...xm_ and N = n,xn,x...xn.. A 
MxN RSN with M inputs ‘and N outputs ts a r-Stage 
Interconnection Network, consisting of a few 
crossbar switches of size (m.xn.) at the ith 
stage for all 1SiSr. The inputs and outputs are 
numbered with base (m,,m,,...,m_) and base (n,, 
N,,---,n_) respectively in the mixed radix numbér 
system. The switches at stage 'i' will be set as 
per the ith digit of the destination tag. When 
either M or N is a prime number the RSN reduces 
to a MxN crossbar switch. 


Let M, and N. indicate the number of inputs 
and outputs at the ith, stage of RSN. The first 
stage will consist of =) number of (m,xn,) 

1 
crossbar switches producing N, = e *n, outputs 


1 ms 1 


N 
with My = M. The second stage will have —* = 
2 


Men 
= 1 number of (m,xn, ) crossbar switches produc- 
1 
Mn n 
ing N, ae ee outputs. In general, the ith stage 
2 
Mn.n,...n 
will consist of rs a= 
| rr m. 
12 i 


switches of size (m, xn, ) each and will produce 


: Mnjn,---n; 
Sa a outputs. 

yMy-- > Mm; 
The rth or the final stage will have 


i 


= number of (m xn) crossbar 


mM.m,..... m n 
12 r r 


switches producing NEN outputs. 


Demultiplexer trees can be drawn from a par- 
ticular input to all outputs for full connectiv- 
ity and the overlap of such M trees will give 
rise to a self routing Interconnection network. 
However the interconnection pattern Radix shuffle 
is of interest throughout this paper. The ith 
stage of RSN will be preceded by a m.-~-shuffle 
(also radix shuffle) for all 1$iSr, as shown in 
Fig. 2. 


Let us consider the interconnection pattern 
in some more detail. The M inputs are numbered 
in base (mj ,m,,---»m )}. The first stage of swit- 

r 


ches will be preceded by a radix shuffle which is 
m, shuffle in this case. The inputs to the first 
stage are numbered in base (m,,m,,...,m_,m,) fol- 
2’ 3 aia | 


lowing the radix shuffle. The outputs of the 
first stage of switches of size (m,xn,) will be 


numbered in base (m,,m,,-.+-,m_,n,)- The inputs 


to the second stage of switches will be numbered 
in base (m, ym, ,- - ym) ,1, ,m,) following a radix 


shuffle interconnection. The outputs of the 
second stage of switches will be numbered in base 
(m,,m,,---,m_,n, ,n,) and so on. We have the fol- 


lowing theorem. 


Theorem 2 The M-input N-output Radix Shuffle 
Network constructed as above is indeed self rout- 
ing. 


Proof Let the source S = (s 
es perce eM 
be desired to be connected to the destination D= 


. After the first stage of 


oy ee 


peBgrr so. 

Radix shuffle, the source converts to (s5S3--- 
ss at the input of the Ist 

rol M,»M.,-+-+,M ym) 


stage of switches. The particular (m,xn,) switch 


at the lst stage will connect the source to the 
output (s,s,...s_d depending on 
P ( a r a ee M22, P 8 


the destination digit d After the second radix 


1 


shuffle it becomes (s48,---5)d)8, m,yM,)-+ «5M, 50 


m, at the input of the 2nd stage of switches and 


2 
(s,s,...s d_d,) 

3.4 r 12 M,,M,,--- M,N, 2, 
At the output of the ith stage of switches it be- 
comes 


at the output. 


; geewi@e. We “deic ad.) 
itl] it2 r 1 2 iM pp Migors cee 
grrrcotl- 

After the rth stage of switches the source S 
is connected to D = (d, d 


m ,n,,n 
r’ 1° 


eer: een 
2? a 


Since the mixed radix system is unique, there 
exists a unique path between any input and an 
output. 


) ie n,n 


Q.E.D. 


Example 3 Let M = 6 = 3 x 2 and N= 8 = 4 x 2 

| m, = 35 m, = 2 and n, = 4, n, = 2s 

The RSN consists of two (3x4) crossbar 
switches in the lst stage and four (2x2) switches 
at the second stage as shown in Fig. 3. The in- 
puts are numbered in base (3,2) and the outputs 
in base (4,2). The 1st stage of interconnection 
is a radix shuffle of base (3,2) giving rise to a 
3-shuffle. The inputs to the Ist stage of 
switches are numbered in base (2,3). The outputs 
of the lst stage are numbered in base (2,4). The 
inputs to the 2nd stage of switches are numbered 
in base (4,2) following the 2-shuffle intercon- 
nection. Finally, the outputs are in base (4,2). 
The connection between input 3 = (11), 2 and out- 

»] 


put 1=(01), 9 is shown with dark line in Fig. 3. 
b 


The RSN is self routing in the sense that 
the connection in a (m, xn. ) crossbar module at 


the ith stage is controlled by the ith digit d. 
of the desired output, OSd, Sn, -1. When all m.'s 


are equal to a and all n.'s are equal to b, the 


RSN reduces to a a’ xb” Delta network [7]. The 
mixed radix system becomes a simple higher radix 
system. The interconnection before stage 'i' be- 
comes m, = a-shuffle for all 18iSr. The first 


stage of the interconnection in RSN allows the 
identity permutation. With r=1, any MxN RSN is 
equivalent to a crossbar switch. 


When N=1, M number of processors share a 
common memory through a Mxl switch. This is 
equivalent to a shared bus system. Although dif- 
ferent interconnection networks can be obtained 
by constructing Demultiplexer trees from input to 
output, they are all equivalent in terms of total 
number of permutations, Bandwidth, probability of 
acceptance etc. The Radix shuffle is just a con- 
venient and useful way of interconnection. The 
Multistage Interconnection Networks (MIN) such as 
Omega [4], Inelirect Binary n-cube [6], Baseline 
[8] etc. employing 2x2 switches are essentially a 


part of our RSN with M=N=2". 


1? 


IV. PERMUTATION CAPABILITIES OF RSN 


Let the capacity C be defined as the maximum 
number of simultaneous input-output connections 
that can be achieved through a network. For a 
MxN crossbar C = Min{M,N} [15]. In a NxN multi- 
stage interconnection network, although some per- 
mutations are not possible still the capacity re- 
mains equal to N. In a RSN, the capacity is up- 
per bounded by the minimum number of inputs/out- 
puts at any stage. For example, in a 6x8 RSN 
with M=6=3x2 and N=8=2x4, the number of outputs 
from first stage N,=4 which is even less than the 
number of processors. As a result no more than 4 
processors can be simultaneously connected to the 
output. 


Lemma 1: In a RSN the capacity C is upperbounded 


by Min{M,N. |1SiSr}. 


In a single mxn crossbar, the capacity being 
minfm,n}, it is worthwhile looking into how many 
possibilities of connections exist. For example, 
if mSn, all inputs can be simultaneously connec- 
ted to m out of n outputs provided no two or more 
inputs address the same memory module. In a nxn 
crossbar n! such different mappings or permuta- 
tions are possible. In a mxn crossbar for mSn, 
there can be (_) combinations of choosing m num- 
bers out of n memory modules. Associated with 
each combination, there are m! permutations. The 
following lemma results: 


Lemma 2 The number of permutations achievable 
by a m xn, crossbar module at the ith stage of 


RSN is given by 


n. 
i 
s. = ()m.! for m. $n. 
i m.° i i i 
i 
ma 
= ( “)n.! for m.2n. 
n.* i > haa 


1 


Theorem 3 If RSN is obtained such that V 1SiSr, 
MoM. =N. for M2N and MSM. =N. for MSN, the 
total number of permutations ithiebable is 

r k 


i=1 
where k. is the number of switches at the ith 
stage an Ss. is as given by lemma 2. 


The proofs of the above lemmas and theorem 
are omitted because of space limitation. 


A conflict is said to occur in a network 
when two or more sources require the same link 
between two stages for reaching their destina- 
tions. For example, in Fig. 3, the connections 
0 > 4 and 2 > 5 require the same link and cannot 
be simultaneously achieved. In case of con- 
flicts, the connections are usually achieved in 
two or more cycles. The following theorem char- 
acterizes the conflict situation in a RSN. 


Theorem 4 In a RSN, there occurs a conflict if 
at least two sources S> ey try to reach destina- 


tions Se such that for 1Si&r, 


(d,d,...d.). = (dd Pada) 


12 i*x $ 12 
(s418340°°° Sy oe (84184427 ° Spy 
Proof sy and s_ are in base (m,,m,,---»m_). 
d. and d are in base (n,.0,,---n,)- 


From theorem 2, a conflict will occur at the 


output of ith stage if ($5 448i49°-°S, djd,---dJ, 
s dd 


= (S4i8u40°°° e oY jee 


Since both are in base (mg Migge Mp Dy» 


Ny»-+-n,) which is a mixed radix system with unique 


representation of a number, the theorem holds. 
Q.E.D. 


V. COST MODEL OF RSN 


Before modeling RSN, it is imperative to 
look into the complexity of a self routing mxn 
crossbar module which forms the basic block in 
RSN. At the ith stage, the m.xn. crossbar should 
be able to connect any one of its m. inputs to 
any one of its n. outputs as determined by the 
ith digit d. of the destination tag. This would 
necessiate flog q control lines from each pro- 
cessor. In addi ion, there may be one request 
and another acknowledge line from each crossbar 
module. The control unit inside the switch will 
decode this address and connect the data lines of 
the particular processor to one of the output 
data lines. This will be achieved through a bi- 
directional data switch available in the crossbar 
module. The number of data lines will depend on 
the pattern of data transfer. In a serial data 
transfer, there will be only one line per proces- 
sor. It may also be practical to transmit data 
in a bit slice mode with some 'w' bits per pro- 
cessor. The block diagram of a mxn switching 
element is shown in Fig. 4a. The complexity of 
the control unit as well as the data switch grow 
of the order of (mn). Assuming unit cost for a 
one input/ one output switch, the cost of a mxn 
switch = mn. The model results in a crossbar 
with mn cross points as shown in Fig. 4b. For a 
2x2 switching element, the cost is 4 units. For 
a MIN employing log,N stages the total cost = 4 x 
s x log,N. = 2Nlog,N = O(Nlog.N). For a NxN 
crossbar, cost = Nn’. This modeling is in agree- 
ment with the model developed in [5] in terms of 
the logic gates. 


Mn n,...n. 
a RON employe et number of (m.xn.) 
m.m,...m. ii 
1 2 1 
switching modules at the ith stage. The cost as- 


sociated with the ith stage = aan ar ea 


oe Hence, 
12 i-1 
r 
the total cost of the RSN= 2% pt cine 
i=1 m.m,...m. 
eZ i-l 


A special case of interest is a NxN RSN 
where m.=n,, ¥ 1SiSr. The total cost is 


r 
N 2 n.. 
i=1 7 

We get the following results. 


Lemma 3 The cost of a NxN RSN for some fixed 
r stages of switching elements is minimum when 
realized asN=n. 


Theorem 5 The overall cost of a NxN RSN is ab- 
solute minimum when realized with all the factors 
of N as prime numbers. 


The proofs of the above lemma and theorem 
are obvious and hence, have been omitted. 


VI. ANALYSIS OF RSN 


In this section, we will analyze the RSN 
with respect to its Bandwidth and Probability of 
acceptance of a request and compare it with those 
of a crossbar. Bandwidth (BW) is defined as the 
expected number of memory requests accepted per 
cycle. Probability of acceptance (PA) is the 
ratio of expected BW to the expected number of 
requests generated per cycle. The RSN and cross- 
bar will be analyzed under the following identi- 
cal assumptions. 


de The operation is synchronous i.e. the mess- 
ages begin and end simultaneously. 


2 Each processor generates a random and inde- 
pendent request. The requests are uniformly 
distributed over all the memory modules 


3. At the beginning of a cycle, each processor 
generates a new request with a probability 
p. Thus p is the average number of requests 
generated per cycle by each processor. 


4. The requests which are not accepted are ig- 
nored. The requests issued at the next 
cycle are independent of the requests of the 
previous cycle. 


Various simulation results indicate that the 
above set of assumptions does not result in a 
significant difference in the performance. More- 
over, it stands well for comparison purposes. 


The BW and PA of a mxn crossbar module are 
given by [7,12,13]. 


=n- oe a 
BW =n - n(1 =, 


PA = —~ - — (1 - By”, 
p-m pm n 
where p is the average number of messages gener- 
ated per processor per cycle. 


The above equations are quite simple and 
they compare well with the results of simulation 
[12]. We compared the results of a 2x2 Delta 
network [7] using the above equations with those 
reported by Nelson [14]. Patel's analysis shows 
a better closeness to the simulation results. We 
will simply use these equations for analysis of 
RSN instead of pursuing the matter further. 


Dividing the bandwidth by n, gives us the 
rate of requests on any one of the n output lines 
of a mxn crossbar module, as a function of its 
input rate 


Poe Aes 


In a RSN, the output rate of ith stage is 
also the input rate to (itl)th stage. Hence, one 
can recursively evaluate the output rate of any 
Stage starting with the input rate of the first 
stage. The output rate of stage r determines the 
BW of a RSN. 


Pinym 
ae ee 


Let Ps be the rate of request at the output 
of the ith ‘stage. Then 


miele 
nN. 
1 


generated by the processors. 


P. = l- (1 - » Po is the rate of requests 


A column Bandwidth (CBW) is the BW at the 
output of a particular column. 


Mn n ae 2, 
i a aera Faas 

i i i mm 

2: 
The output BW is the CBW at stage r. BW=N - Pe 
The probability that a request will be ac- 


cepted in RSN = P, = Mp. 


VII. NETWORK OPTIMIZATION 


In this section, we present some interesting 
results on how to design a cost effective inter- 
connection network. The BW reflects the perform- 
ance of a network and a cost model was obtained 
in section V. We will define a cost factor & as 
the ratio of BW to cost. Since we do not have a 
closed form solution for BW, most of the results 
presented in this section are experimental, ob- 
tained through computation. We will study the 
characteristics of both MxN and NxN networks. 


Given a value of N for a NxN RSN, there may 
be several ways to factorize N into r components. 
As an example, for N=16 and r=2, N can be ex- 
pressed as 8x2 or 4x4. The obvious question a- 
rises, for given values of N and r, what is the 
optimum realization of a network. This will be 
referred to as local optimization. The following 
observation is made. 


Conjecture 1 The most cost effective realization 
of a NxN RSN in some r-stages is obtained when 
m. =n, = \N for all 1SiSr. 
i i 

The conjecture is obtained from computation- 
al results. Let r=2. The cost factor &=BW/cost 
is plotted in Fig. 5 for various values of N, a 
power of two. The peaks are obtained at a the- 
oretical value of JN, even if this is not an in- 
teger. A computer search was carried out for a 
few values of r which resulted JN as the optimal 
realization. Since “/N may not be an _ integer 


for a fixed r, the m. 's should be as close to tN 
as possible. 


There is another tradeoff in building a RSN. 
For example, 16 can be factorized as 2x2x2x2, 
4x2x2, 4x4 etc. The design which gives the high- 
est cost factor is optimal. This will be re- 
ferred to as global optimization of RSN. 


It has been impossible to derive a oe 
form solution for the optimal value of r (r 
For N=n* and for values n=2 and 3, the Sept): 
values of r are plotted in Fig. 6. For N, a 
power of two the following conjecture states the 
most important observation. 


Conjecture 2 For NxN RSN with N a power of two 
as many 4x4 switches as possible should be em- 
ployed to yield the most cost effective realiza- 
tion. 


From Fig. 6 for a 4x4 network, r = 1. 
This means one 4x4 switch should be sip i bred in- 
stead of conventional four (2x2) switches. For 
N=8, fr tn? thus the realization should be as 
N=4x2 opt owns one stage of 4x4 switches and 
another stage of 2x2 switches. For N=16, two 
stages of 4x4 switches are desired. So, for N, a 
power of four, 4x4 switches should be employed at 
all stages and for N a power of 2 but not a power 
of 4, the last stage will consist of 2x2 switches 
and all the previous stages should consist of 4x4 
switches. 


A study on loosely coupled system (distribu- 
ted) with a sort of hypercube topology had also 
resulted in a similar observation [16]. With N, 
a power of 3, optimal structure is obtained when 
3x3 switching elements are utilized. In general 
for any N, a discrete optimization may be carried 
out to determine the most effective realization. 
The networks realized in this manner will be 
referred to as Optimal RSN (ORSN) in this paper. 
The BW, PA and cost efficiency obtained in a NxN 
ORSN for N, a power of two are compared in Figs. 
7, 8 and 9 respectively with those of RSN(2) and 
crossbar. RSN(2) is the MIN obtained with 2x2 
switches at each stage which is equivalent to an 
Omega network. 


We will now examine the performance of a MxN 
RSN. It is known from Bhandarkar's analysis [12] 
that if the number of memory modules is increased 
compared to the number of processors, the BW 
increases. This is evident because, with the 
availability of more memory modules, less con- 
flicts will occur and more number of processors 
can be kept busy in a probabilistic view. The 
variations of BW and cost efficiency by adding 
more memory modules are plotted in Figs. 10 and 
11 respectively. The number of processors is 
kept constant as M=16. Whenever a few memory 
modules are added fresh design of the RSN is 
carried out and the cost efficiency (&) is calcu- 
lated to yield a new ORSN. The performance of 
ORSN is plotted together with that of a MxN 
crossbar switch. 


The BW of a crossbar increases exponentially 
and theoretically reaches 16 at N=». For ORSN, 


the saturation starts earlier and remains con- 
stant at about 10. This is because of the con- 
flicts inherent in a RSN. It may be pointed out 
here that ORSN is designed such that the cost ef- 
ficiency is at the highest level of all the de- 
signs. It was observed that the BW improves if 
other size of the switches were allowed but this 
will happen at the expense of cost effectiveness. 
The crossbar itself is also a part of RSN any 
way. 


A similar experiment was carried out to see 
the effect of adding a processor to a fixed num- 
ber of memory modules. Figs. 12 and 13 are ob- 
tained with various values of M with N kept fixed 
at 16. The variation of curves obtained for 
crossbar can be easily predicted theoretically 


from equations, BW = N - N(1 - Bs ie 1 and 


1\M 
BW _ 1-(1-5) 
MN M 
crossbar approaches N. However, with a recursive 
equation for BW in case of RSN and the discrete 
optimization required, it was not possible to 
theoretically predict the characteristics of 


cost efficiency & = As Moo, BW of 


ORSN. The results however seem to be quite real- 
istic. 
VIII. CONCLUSIONS 
A broad class of networks called Radix 
Shuffle Networks (RSN) was introduced. The 


network is self routing in the sense that the 
output of the switches at ith stage is selected 
by the ith digit of the destination address. The 
network is so general that it includes systems 
ranging from a shared bus connection to a cross- 
bar. The cost modeling was approximate but truly 
represents the complexity involved. The band- 
width was choosen as a performance measure with 
the assumption that the cycle time is almost same 
for all the realizations of RSN including cross- 
bar. Thus, depending on the actual parameters, 
the cost efficiency curve may shift a little. 
However, we are convinced that the comparisons 
reported in the paper will still stand. 


The observations indicate that by adding 
more number of memory modules the bandwidth 
increases and the RSN provides an efficient 
design for such an interconnection. Many useful 
results were presented. An important observation 
is that a NxN MIN seems to employ (4x4) switches 
instead of (2x2) switches conceived of so far. 
When N is a power of two but not a power of four, 
all the stages can employ (4x4) switching ele- 
ments except the last one which will employ (2x2) 
switches. The Radix Shuffle makes that connec- 
tion possible. 
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AUGMENTED AND PRUNED N LOG N MULTISTAGE NETWORKS: 
TOPOLOGY AND PERFORMANCE 


Daniel M. Dias, Mitre Corporation, Houston, Texas 


and 


J. Robert Jump, Rice University, Houston, Texas 


Abstract -- In this paper N log N Multistage 
Interconnection Networks (MINs) are augmented for 
better reliability and pruned to have differing 
numbers of input and output links. Augmented MINs 
have multiple disjoint paths from network input to 
Output links. Optimal pruned MINs with buffers 
between stages have a non-intuitive topology. 


Summary 


Several (NxN) multistage networks with log N 
stages (referred to as MINs) have been proposed in 
the literature [1-4]. It can be shown [5] that 
these networks can be constructed recursively as 
illustrated in Figure 1.  MINs can be used to 
interconnect modules of a computing system that 
communicate by passing fixed size packets through 
the MIN [6J. Each packet contains data and a 
destination address. At each stage, one digit of 
the destination address is used to route the 
packet to the next stage via the link with the 
Same label (see Figure 1). This is referred to as 
digit controlled routing. 


The MINS in [1-4] have a unique path from 
network input to network output links leading to 
poor reliability. One method of augmenting MINs 
is to add an extra stage as illustrated in Figure 
2. It can be shown [5] that with a judicious 
selection of the interconnection to stage S* in 
Figure 2, there are exactly b disjoint paths 
(except for the common input and output links) 
from each network input link to each network 
Output link and digit controlled routing can still 


be used. An example of a (2>x 23) augmented MIN 
showing the two disjoint paths appears in Figure 
3. Additional redundant paths through the network 
can be obtained by adding further stages. 


MINS can be pruned by eliminating some of the 
Switches. This paper considers regular pruned 


MINS only [5]. A (bx b") regular pruned MIN 
consists of arbitration, distribution and square 
stages. Networks with n greater than m are called 
arbitration networks while networks with n_ less 
than m are called distribution networks [7]. The 
construction of arbitration stages is shown in 


Figure 4. Distribution stages with b" inputs and 


pt? Outputs are constructed 


manner. At square stages, the original MIN is 
left untouched. This paper will present 


performance results for (2"x 2") arbitration 


in an analogous 


networks only. There are () different ways of 
choosing arbitration stages in a (2 X 2") MIN to 


produce different (2 x 2") arbitration networks. 
Extreme arbitration networks, shown in Figure 5, 
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have the arbitration stages concentrated either at 
the input stages or at the output stages of the 
network. 


Unbuffered networks are modelled as operating 
in time slots. In each time slot an attempt is 
made to pass packets at input links to the desired 
output links. If two packets must pass through 
the same switch, a conflict is said to occur and 
one of packets is selected and passed while the 
Other is rejected. The performance results 
reported here are for the case when rejected 
packets are discarded. Approximate estimates of 
performance for the case when rejected packets are 
retried in the next time slot can be found in [5]. 


Buffered networks have first-in-first-out 
buffers of fixed maximum length between stages. 
The operation of a (2 x 2) switch is modelled 
essentially as follows [8]. The (minimum) packet 
delay at a switch is modelled in terms of two 
timing parameters: time t_select to select the 
Output link to which the packet is to be passed 
and time t_pass to move the packet through the 
switch. The t_select phase for two packets at 
different links can occur’ simultaneously. 
However, if two packets (after the t_select phase) 
are directed to the same switch output link, one 
is randomly picked for transfer and the other is 
blocked. Further, the selected packet must wait 
until a buffer at the input of the successor 
Switch becomes available; after a successor switch 
buffer becomes available, the packet takes time 
t pass to pass through the switch into this 
buffer. 


The throughput is defined informally as the 
average rate at which packets are put out by the 
network. The normalized throughput (NTP) is the 
ratio of the throughput obtained to the maximum 
possible throughput assuming that no conflicts 
occur in the network. For unbuffered networks, Po 


is the probability that a packet exists at a 
network input link in a time slot and P. is the 


probability that a packet at a network input 
buffer is accepted in a time slot. The results 
reported here were obtained using both event 
driven simulation and an approximate model with 
coupled Markov chains [5]. 


The NTP for augmented MINs are shown in 
Figures 6 and 7. It is seen that, for fault-free 
operation, there is a small decrease in the NTP 
for the augmented MIN as compared to the 
corresponding MIN. The worst case single internal 
switch or link failure leads to about one half the 
fault-free performance. However, for most 
failures, the performance is only slightly below 
the fault-free performance. 


For unbuffered arbitration networks, the 
extreme networks are seen to have extremes in 
performance as shown in Figure 8. The same is 
true for buffered arbitration networks if all 
switches have the same speed (i.e., the maximum 
packet output rate at a link is the same at all 
switches). The network stages can be matched so 
that all stages have the same maximum throughput. 
This may be done by a serial to parallel 
transformation at an arbitration stage [5]. 
Matched arbitration networks with maximum 
throughput have a non-intuitive topology. 
Examples of matched arbitration networks with 
maximum throughputs for a single buffer between 
Stages are shown in Table l. Here, the 
arrangement of stages that give the highest 
normalized throughput is given as a string of A's 
and S's, where A denotes an arbitration stage and 
S a square stage. 
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PERFORMANCE OF SELF-ROUTING SHUFFLE-EXCHANGE INTERCONNECTION NETWORK IN SIMD PROCESSORS 


Jamshed H. Mirza 
Polytechnic Institute of New York 


333 Jay Street, 


Abstract : This is a study of the class of multi- 
stage Shuffle-Exchange (S/E) networks used in Sin- 
gle-Instruction-Stream-Multiple-Data-Stream (SIMD) 
processors, and their ability to realize intercon- 
nection requests which are arbitrary permutations 
on the set of Processing Elements (PEs). The netwo- 
rk is made self-routing, and a recursive model is 
proposed which analyses a network in terms of 
smaller ones. Performance parameters such as band- 
width, blocking probability, and the number of 
passes necessary to realize an arbitrary permuta- 
tion are obtained. 


Introduction : There are several topologically 
equivalent multi-stage S/E networks that have been 
proposed [2-4]. We will use the Omega network [5] 
as a representative of the class of S/E networks. 
However the results we obtain are equally applica- 
ble to all S/E networks since they have been shown 
to be topologically equivalent [6,7]. 

In SIMD processors, the interconnection requ- 
ests by the N=2° PEs are made synchronously, and 
the source-to-destination pairs usually represent 
a permutation on the set of PEs. Consequently, we 
will analyse the S/E networks with respect to the- 
ir ability to realize arbitrary permutations. 

It has been shown that any permutation can be 
realized in a maximum of three passes through the 
network[7] .However, this requires a setup time 
that is time-consuming even on SIMD processors. We 
will assume instead that the network is self-rout- 
ing so that the switch settings are determined dy- 
namically and no setup time is necessary. Each 
data element carries with it the destination add- 
ress as a tag and the setting of a switch in stage 
j is determined by the jth bit of the tags at its 
two inputs[5]. Of course symultaneous connection 
of more than one sourse-destination pairs may res- 
ult in conflicts. 


The Model : Let N=2", If i€ [o, 27-1], then i = 
(4) 0-1 ee (i), (@))- Also, let n represent a 
2°52" Omega network. 

The CL network in fig. 1 is drawn to highli- 
ght the recursive structure of such networks. This 
makes possible a recursive procedure for analysing 
a large S/E network in terms of smaller ones. 

Assume that the interconnection request is a 
permutation specified by D = (Dy »D, > Sige Dy-1)? 
where D is any arbitrary permutation of the ‘set 
(0,1, ... ,N-1). It specifies that input i is to 
be connected to output D,. Since D is a permuta- 
tion, there are exactly hy2 tags with (D,) 47 0, 
and the other N/2 have (D) -1 = 1. 

In our self-routing network, no request is 
blocked or turned back mid-way through the network 
If a conflict occurs at a switch, one of the requ- 
ests is allowed to go to the right output, while 
the other is misdirected and forced to go to the 
wrong output. At the output of the network the re- 
quests that have correctly reached their destinat- 
ions are filtered out, and only the misdirected 
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ones are made to pass through the network again,. 

starting from the destination they reached during 

the previous pass. The data elements carry with 
them two l-bit flags, m and r, besides the n-bit 

tag identifying their destination. At any time, a 

request may be a correctly directed request (cdr, 

m=0) if it has been directed to the correct output 
at all switches it has encountered during the cur- 
rent pass, or a misdirected request (mdr, m=l) if 
it was misdirected at least once during the curre- 
nt pass.Also, since multiple passes through the 
network may be necessary, the r bit tells us whe- 
ther a request has(r=1) or has not (r#0) already 
reached its destination during some previous pass. 
Initially all requests start off with m=0 and 
r=Q. The self-routing algorithm works as follows : 

(1) if a mdr and a cdr meet at a switch, the swi- 
tch is set by the cdr. 

(ii) if 2 cdrs or 2 mdrs meet, the switch is set 
according to the tag at the upper input of 
the switch. If a conflict occurs, the lower 
input request is misdirected and is so marked 
by setting its m bit equal to 1. 

At the output of the network the cdrs(m=0,r=0) 

are filtered out by the respective PEs. Also, if a 

request has m=1 and r=0, its m bit is reset to 0 

for the next pass. For all other requests, m and r 

are both set to 1 to create "dummy" mdrs.The requ- 

ests are then cycled again through the network. 

Thus during each pass, there exactly N requests - 

some cdrs and some mdrs. Since the self-routing 

algorithm always resolves conflicts in favour of a 

cdr, the mdrs can be considered to be "don't care" 

requests. We can assume that their tags have what- 
ever value is necessary to justify our assumption 
that every network in the recursive model is 
presented at ifs input with a permutation on the 

set (0,1,...,2 -l). 


Analysis : At stage(n-1) of the OL network, D 
and D N/2 meet at a switch. As long as their 

(n-1) th bits are different, conflict does not 
occur. If their (n-l)th bits are equal, a conflict 
occurs and Dsan/2 is misdirected. Also, since exac- 
tly N/2 of the tags have the (n-1)th bit 
equal to 0 and the remaining have it equal to l, 
conflicts will always occur in pairs. Then, 


N/2 2e 


(x72) 


where 0£ 2c$N/2 or OS cS WN/4. 
Define for stage(n-1) of (2, a (274-1) (2741) 
Stage Transmission Matrix S , such that 
S (a,b) = P(b cdrs leave stage(n-1) | a cdrs 
n | 
enter stage(n-1)) 
Using (1) we can find S_(N,N-k) for k even and 
O<KLN/2. For k odd or” N/2Z k4N, S_(N,N-k) = 0. 
Using S_ (N,N-k), we can find S (2,b) for all 
O<a<N and "“O<ba. For b>a, "'s (a,b)=0. We 
can write S (a,b) as S_ (N-i,N-(it+j)), n which gives 
the probability that we enter stage(n-1) with fi 


(1) 


mdrs and j new mdrs are created at that stage.We 
have to calculate the contributions of the (N,N-k) 
cases to the (N-1,N-(itj))cases. This represents a 
situation where there are i mdrs at the input of 
stage(n-1), there are conflicts at k switches and 
no conflicts a k"=N/2-k switches, and as a result 
of which j new mdrs are created (j<k) at the out- 
put of stage(n-1). Thus k'=k-j of the conflicts do 
not create a new mdr because at least one of the i 
input mdrs appear at each of those k' conflict sw- 
itches.A little reflection will verify that at le- 
ast Ll=MAX(0,i-(2k"+k)) of these k' switches will 
have 2 input mdrs, and at most L2=MIN(i-k',k') of 
them can have two input mdrs. Then the contribution 
of the (N,N-k) case to the (N-i,N-(itj))case is : 
(N-N-k) —® (N-i,N-(itj)) = 


ie L2 & 2k" 
aoe 

a p ok P i-k'-p 

eM “lS ae ARTO 


=L] Sa (N,N-k) 


() @ 


We can then find the elements of sy by : 


a. (N-i,Ni(itj))= oe (x N-k) —> (N-i ,N-(i+j) )| Sh (N,N-k) 


k even 
Of kSN/2 (3) 
Using these equations we can fing S for any k71. 
For any. » define the (2 +15 (2 +1) Network 
Transmission Matrix T,, such that : 
T, (1,5) =P G cdrs leave 1 | 4 cdrs enter SL ) 
T, is lower-triangular and gives the transmission 
through all stages of the network. Note that T) = 
S, = I (where I is the unit matrix). . 
To determine T, ¢e need to define for a netw- 
ork IL, a (2 -+1)x(2 +1) matrix such that : 
R. (455) = P(j edrs get through h, i cdrs 
get through stage(n-l)” of £...) 
is also lower-triangular and gives the transmi- 
ssion through stages (k-2) to 0 of k* Then, 
T = Sx (4) 


kK ooking at the recursive structure of the net- 


work (fig.2) we can_verify that - 
ape Ba hes con fy (*') 


(41,42) (41,52) 
T,-41,4D7,_, 42,452) 


if j4i 


= 0 | if j>i (5) 
Here the summations are over all distributions of 
i into (i1,i2) and of j into (j1,j2) such that 


max(0,4-2874) £41,424 manc4,2% 4), jx <41, 42€42. 


Then, starting with T, = I, we can recursively 
find R. from T, ,, and then Ty. from S, and ns 
until | Th is Setermined for a 1, network. 


Performance Parameters : Using the recursive model 
we can determine several important performance par- 
ameters associated with S/E networks. 

2 The Bandwidth B (i) of a NxN S/E network (N= 
2) is defined as the expected number of cdrs 
at the output, given that i cdrs entered the net- 


work. Then, N | 
BD = BIT td) (6) 


The Blocking Probability is the probability that a 
request that enters as a cdr is blocked or misdir- 
ected during its passage through the network beca- 
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use of conflicts. It depends on i, the number of 
cdrs at the input, and is given by : 


N 
p(t) = 2 (4 §)T G5) = 1-BG/t (7) 


Define the load L = i/N. Figures 3 and 4 show 
the variation ofthe Bandwidth and Blocking Probabi- 
lity with respect to the Load. Plotting the results 


with respect to L rather than i normalizes the plots 


for the different size networks. For L<1/2 B incr- 
eases almost linearly because at low loads mdrs are 
created mainly at the first stage and very few at 
latter stages since the chances of 2 cdrs meeting 
in a conflict at later stages is low. At higher 
loads (L>1/2) there are more chances of 2 cdrs 
conflicting in later stages and the bandwidth 
increases at a slower rate. Fig. 3 also shows the 
results obtained by simulation and the two values 
are seen to agree closely. The plots also suggest 
that the bandwidth increases at a faster rate with 
increasing n. This is verified in fig. 5. Fig. 6 
gives, for different size networks, the blocking 
probability at each of the n stages rather than the 
network as a whole. Its shape would serve to expl- 
ain the shape of fig. 5. Since blocking probability 
falls more dramatically in later stages of larger 
networks, the bandwidth increases at a faster rate 


as the network size increases. . 


To determine the expected number of passes 
necessary to realize an arbitrary permutation, we 
define an absorbing Markov Chain (MC). The MC is in 
state i if i requests still need to be routed to 
their destinations. If X_ is the state transition 
matrix for the MC, then ws 

X (4,4) = P(j requests at the output still need 
to be routed i requests at the 
input need to be routed) 

Xn(i,j) = T (i,1-3) for j<i 

= 0” for j>i 

Also, X (1i,i)=T (1,0)=0 for all i>0, and 
X_ (0,0)=1. Thus Stste-0 is the only absorbing 
state, and all other states are transient and are 
visited at most once. If #passes(i) is the expec- 
ted number of additional passes necessary if the © 
MC is in state i, then : 

#passes(0) = 0, -1 


#passes(i) = 1+ >. X (1.5) #passes (4) for 170 


Fig. 7 shows j=0 the variation of 
#passes(i) for different size networks at different 
loads. The plot for L=1 gives the expected number 
of passes to realize an arbitrary permutation. Fig. 
8 shows that #passes(i)/n is more or less constant 
for different network sizes and is about (2/3)n. 
for L=l. 


Conclusions : In this paper we presented a recurs- 
ive model for a self-routing S/E network and used 
it to determine several performance parameters. 
Simulation results were found to agree closely with 
the analytical results.Although the network was 
used here, the results are equally applicable to 
all S/E networks. A more detailed treatement and 
discussion will be found in[1]. Further, in [1], 

it is shown that the class of Bit~-Permute-Complem- 
ent permutations [8] which include many of the 
permutations commonly encountered in parallel alg- 
orithms require only two passes through the self- 
routing S/E network to be realized. 
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Abstract -- In this paper the SP2I 
( Single-stage Plus 21) interconnection 


network, which is applicable to the CVCVHP 


with VCM (Cellular Vector Computer of Ver- 
tical-Horizontal Processing with Virtual 
Common Memory) and other multiprocessor 
systems, is discussed. Starting from the 
need for dynamic and parallel data align- 
ment, we investigate various properties of 
confliet—free routing, describe the itera- 
tion method of automatic vector-—routing 
which may be used to Solve the conflict 
problem in the SP2I network. Furthermore, 
we extend the iteration method to the net- 
works of AIM,Q,d,Indirect Binary n-Cube, 
Baseline, ete., which are usually seen in 
literature, Then the problem of routing 
conflict in these networks, which has not 
been well solved yet so far, may be solved 
efficiently. Finally, the implementation 
methods of several common data manipula- 
tion functions without conflict are given, 


ntroduc 


The background of this paper is the 
problem of dynamic and parallel data alig- 
nnent between vector processor and virtu—- 
al common memory in the CVCVHP with VOM 
(Cellular Vector Computer of Vertical- 
~Horizontal Processing with Virtual Common 
Memory). We investigate various properties 
of SP2I network and Solve the routing 
conflict problem in SP2I and many other 
networks. 

The CVCVHP system [1] consists of WN 
cells. Suppose N=2", Every cell Cj has a 
precessor P, and & memory bank My with 
capacity of 2" words, j=0,1,+++,N-1, {M5| 
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j=0,1,°¢¢,N-1$ constitutes the VCM. The 
address D=D'e2"+D" points to the D'-th 
element of Mpw » D' is local address,D" is 
the memory bank number or cell number. Let 
J=(0,1,°¢¢,N-1), Jy=j is a binary constant 
number of n bits which is set up in Cy ° 

The SP2I network structure is used 
and it is really half of the PM2I single- 
-stage network [10,11]. A routing register 
B, is set up in C, , {B, | j=0,1,0¢¢,N-1} 
constitutes the vector routing register £. 
The intereonnection of the CVCVHP system 
is just the interconnection among elements 
of FE, faking the subscript j of element of 
E as variable, the SP2I network has n 
interconnection functions [10,11] : P,(j)= 
j+2i (mod N), i=O0,1,°¢¢,n-13 j=O,1,00¢,N—1, 
Fig.1 shows the interconnection of the 
j-th cell with others. Fig.2 shows the 
information structure of BE; » these data 
items all participate in routing. 


Bs_an-1 Bj_21 Bj_20 #5 33,20 83,21 8j,om1 
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Fig.1, The interconnection 
of BE, with others 


data local routing valid 


address distance bit 


Fig.2, The information structure of B, 


Routing Rules of SP2I network 


Let x=x 4% 2° 2 0X4 Xo be the binary 


n— Q 
notation of x, G) =z, L(k,x)=x,_4%}_2 


08 ex Xn, H(k,x)=x ~1*n~2°* **n-k ° 
Suppose Cj produces an address Dj= 


Dy o2n+D5 - In fetching operation, Dj must 
be transmitted from c, into Cos » then the 


data A, can be fetched out from Mpw - The 


J 
fetched data A, must be transmitted from 
Con back into Cy» then C, can use the data 


J 
Aj. In storing operation, data and local 
address D} must be transmitted from C 
into Cows then the data can be stored into 


Mpwe This kind of transmission of data or 
address from C, to Cp, is called "routing", 
the transmission of A, from Cpy back to C, 


is called "“return-routing". Both routing 
and return-routing are executed in B. Let 
6, be the routing distance, y be the 
return-routing distance. j,=D"-j (mod N), 
53=8-d; (mod N), O<j<N-1. Both dj and J; 
are binary numbers of n bits. 

In the network controlling Structure, 
routing is controlled by 0, » return-rout- 
-ing is controlled by J}. When the Memory 
Control Unit (MCU) sends out the command 
of “routing +2in , the routing rule is: 

(a) E,=0, this shows that E, is inva- 
lid. So B. has no effect on Biyod e« If the 
information of B,_oi must move to E,, then 
BE. changes into new state and new value; 
otherwise §., remains unchanged. 

(b) Ej="5 this means BE, has valid in- 
formation.Routing result depends on (05) 53 

(i) (d,),=1, the content of B, has 
to move to Bj421 « If the content of Biot 
must move to E, , then E, changes into new 
value; otherwise set €,=0. 

(44) (d3) 4=0, this means By has va- 
lid information and does not need to move. 
B, has no effect on B i421 - If the content 
of Bai does not need to move to Bs, then 
By remains unchanged; otherwise E, changes 
into new value, the old valid information 
of E, is forced to disappear, This is 
the routing conflict. 


fhe procedure of parallelly fetching 
a vector is: 

(a) Compute the address vector BD by 
all cells, then resolve D into D’ and BD". 
Compute g=D"-J (mod N). Send DBD’, @ and 
corresponding valid bit vector E into FE 
(i.e., D'>ED', g >Ed, EEE, where ED' 
stands for D' of BF, Ed for g of EB, Be for 
€ of EF. This notation will be used below 
not only for E but also for F and Ri). 

(bo) Perform n steps of routing +2°, 
+21 coe, 420-1, | 

(c)Fetch out corresponding data vec- 
tor according to ED' under the control of 
Ee (i.e., every cell which has valid local 
address fetches a data from its private 
memory bank). Then the fetched data vector 
is sent into EA. Compute j'=N-Eé. §'=> ES. 

(d) Perform n steps of return-rout- 
ing 420,421, 000,42871, | 

(e) EA is sent into B under the con- 
trol of BE (i.e., if B&,=1, then EA,3B, ; 
otherwise EE,=0, then BA #B,), where B is 
the vector buffer register for lookahead 
fetching. Then the data vector is in B and 
ready for use. 

The procedure of parallelly storing a 
vector is Similar with that of fetching 
operation, but the storing operation of a 
vector without conflict does not need the 
return-routing. 

The SP2I network alSo may offer ano- 
ther kind of routing command: “broadcast- 
-type routing +21", denoted as "routing 
+2i", fhe only difference of "+22" from 
"421i" is: when the content of EB, moves to 
Bj 24 and the content of B_oi does not 
move to BE, , “+2i" makes BE, still remain 
its original state and value, but "+21" 
makes B, become "empty", 


Properties of Conflict-free Routing 
For conflict-—free vector routing,only 
2n steps: of routing treatment are needed 


for fetching a vector, n steps for storing 
a vector. 


Theorem 1:For an address vector of N valid 
elements, n steps of routing treatment 
with arbitrary-order have no conflict if 
and only if the @ satisfies the condition 


(x), i.e., for any integers i,j (O<j<N, 
O<i<n), there exists | 
Si,2i= Gy (moa 24**") (x) 


Proof: Please see paper [2]. 

For a A-ordered vector ¥ whose addr- 
ess vector is De=(D,D+d,D+2A,D+3A,°ee) , 
where D is initial address, constant A is 
address increment, if K < N/god(A,N), then 
any K consecutive addresses of BD point to 
K different memory banks, where gcd(A,N) 
is the greatest common divisor of A and N. 
So,if A is odd, then ged(A,N)=1, therefore 
any N consecutive addresses of DB point to 
N different memory banks. Generally speak- 
ing, pointing to N different memory banks 
is not sufficient enough to ensure routing 
conflict—free. But for odd A, according 
to Theorem 1, we have Corollary i. 

Corollary 1: The A-ordered address 
vector D of N valid elements with odd A is 
conflict-free in routing. 

Proof :DeDtdeA, f=D-d=Dt+d (A-1) (mod N), 
J; 447 5j=D+ (J+2*) (4-1 )-(D+j (4-1) )=24(A-1). 
For odd A, A-1 is even, .’. G; +2i- dj=2* (4-1) 


=2it', 20 (mod 27*'),4.e.,4 satisfies the 
condition (*). According to Theorem 1, the 
Corollary is true. | 

In praetice,the MCU issues n steps of 
routing command usually in certain order, 
such as +29 421 000, 4207! or 42071 yon? 
000,42", 
jomma 1: Suppose data A starts moving from 
EB, routing distance is 6 ,routing command 
order is 420,421 600,429"! artor k steps of 


£ké 
routing (1<kén), A arrives at Ei4L(k,d) . 


Proof: xk steps of routing cause A 
to have moved a distance L(k,d), So A must 
Lemma 2: Suppose data A starts moving from 
E, ,routing ar gppials is J sprouting command 
order is +2°~ ytonne 00,429 Arter k steps 


of routing, A arrives at £E 
(i<kgn). - | 
| Proof: k steps of routing cause A_ to 
have moved a distance 2°-~-H(x,J), so A 
must arrive at B,.on-keH(k,d) ° 

Theorem 2: E has N valid elements, routing 
command order is +2°,4+2',+«+,+2"-', then 
there will be no routing conflict if and 
only if for any integers i,j,k(O0<i<N,O< jen, 
i<ken), if j—rith (ke, dj #5+L(, d;) (mod N). 

Preof :Conflict-free routing means for 
any B,, B, (O<i<N,O<jcN, ifj), after any k 
(1<k<n)steps of routing,they do not arrive 
at any Same routing register.Based on Lem- 
ma 1,it is obvious that Theorem 2 is true. 
Theorem 3: E has N valid elements, routing 
command order is $201 poBme o.e 40 ‘é 
Then there will be no routing conflict 
if and only if for any integers i,j,k 
(O<icN,O<j<N, 1¢ken), ifj—p>i+20-K-H(k, d,) 
£j+2°-* u(x, g;) (mod N). 

Proof: It follows from Lemma 2 and 
the concept of conflict-—free routing. 
Lemma 3: Suppose x,y move from E’ Pi to 
By Ey respectively.If (i) in<dorixtsy » 
(ii ) Som ozi nig s (414 ) db, > G» (iv) routing 
command order is 429,42) eee 4.g0n! then 
x and y are conflict-—free. 

Proof: When d= Jy » X and y either 
both move or both do not move at the same 
time. This is called "parallel moving". 
Obviously, in this case x and y are conf- 
lict-free. Below we consider the case of 


j+20-ku(k,d) ° 


| d, >a. Without losing generality, suppose 


gah, 1D, pee ey y IU ye 0 ey (1) 
(2) 


By the theorem conditions, it is easy 
to know that either ((i,919)A(jpzjq))=1 or 
((ip<dg ACSp<dg) )=1, where A is logical 
AND. If ((ig>in)A(Jy2Jq))=1, then 
(Jo~4g)—(Jgn~ig)=(dgntg)-(Ig-dg)=h- 
if ((ig<dg)A(I<Jg))=1, then 
(Jondg)-(ig-dy) = (N-ig tty) -(N+5p%J9)= J- d, ° 


Anyway, we obtain 


(Jgmtp)-(Sgndg = d-dh (3) 
From (1) and (2), we get 
d,-K=L(p+1 ,d,)-L(p+t ,d,) (4) 


From (3), we get 
Jomtg=dg-igt - G > EF (5) 
If there exists an integer k (1<k<n), 
and a routing conflict occurs at the k-th 
step of routing, then from the Theorem 2, 
we have 
ipth(k,d_)=Jgt+h(k,d_) (mod N) (6) 
Jomtp=h(ky dG, )-L(k,d,) (mod N) (7) 


(a) If k2pti, from (7), (1),(2),(4), 
we obtain Jo~ip=d,- This is contradicted 
by the fomula (5). 

(o) If kgp 

(1) When L(k,d_) 2L(k,d,), then (7) 
becomes 

Jgnmip=L(k, Ff )-L(k,d ) 
Substitute (1) and (2) for 
(5), respectively, we get 
Jomtg=(Jg-dg)+(1u,_, eeem,-Ou'_, +e out) oat 

+L (kph, )-L (ie, F,) >L (ie, J, )-L(ie, J) 
This contradicts the formula (8). 

(44) When L(kyd_)<L(k,d, ),from (6), 
since i<j » 30 ((Jo+L(ksd5) > MN) A (Ag+ 
L(k,d,)<N))=1. This must be the case of 
((Ir<dg )ACigsig) )=1. Thus (6) must be 


(8) 


cd and J, in 


x 


Toth (ky F )ajoth (key d, )-N, then N-Jjotiy 
+L (ky A, )=b(k, ocak » 30 
Lath (kyo <2 (9) 


From By pth(iky d_) to B, , x has to cover 


the distance of N-(ij+L(k,d_))+i, . From 
(9) pU=(tg+h(ikyd,))+1g>N-2*+4g9N-2" (10) 


On the other hand, according to (1), 
the distance from E yt h(i, d,) to 5, is 


H(n-k,d_)2", But H(n-k,d)+2%n-2%, this 
contradicts the formula (10). 

By @11 the above contradictions, Lemma 
3 must be true. 


Theorem 4: Suppose a 


moves from Ey to 
| kk 
Pye » the routing distance of x 


1, is A 
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(k=1,2,°°°,K). If for k=1,2,°°°.K=1, 
(i) O£4, <4, 44M, OS5,<5,, 468 


(it) neg? Jx41 75x 
ee 1? tit 


(iv) routing command order is 
429 42! Seeegao™ 
then for S={x, k=1,2,°°°,K}, n steps of 
routing have nS conflict. 
Proof: For any p,q (i<peqéK), by Lemma 


3, x, and x, are conflict-free, so is 3. 


q 

Lemma 4: Suppose x,y move from Pi Fe to 
Bieeh. Tenpeesavelys If (i) 1o<io» ip<iys 
(44) Jg-igsdig-ty » (di) dD, <d ,(iv) rout- 
ing command order is +20-1,42N-2 202.420) 
then x and y are conflict-free. 

Proof: The proof is similar to that 
of Lemma 3, 
Theore 


: Suppose x, moves from E, to 
k 


E, » the routing distance of x J; ‘ 
k k 


is 
(k=1,2,°°°,K). If for k=1,2,°°°,K-1, 
(1) O€4, <i, .4<N, 055, <5,,4< 
(441) Tithe < Sic 1 Sk 
(441) Js d, 


k+1 
(iv) routing command order is 


eQOnl gb! 42 
then for at { x, | k=1,2,°°°,K}, n steps of 


routing have no conflict. 

Proof: The proof is similar to that 
of Theorem 4, 

Theorems 6,7, and 8 describe the abi- 
lity of SP2I network to simulate the fund- 
amental functions of other interconnection 
networks. The proof is Simple and will be 
omitted here, 

Vector Y=(¥),Y,.°°*sY,) 418 called 
canonical if and only if for all k (k=0,1, 
coo, K~1), Y, is in C,, where K<N. Suppose 
P=Pp_1Pp_2°°*P1Po is the binary notation 
of the order number p of cell C_ , 

Indirect Binary n-Cube network has n 
interconnection functions [10,14] : 

Cube, (F}(C, (0) 10, (1) 9224, (N-1)) 


where C, (P)=Py_iPaio***Picg1PPna'**PiPo 3 
p=0,1,°¢*,N-1; keO,1,e¢¢,n—-1 , 

The basic interconnecting structure 
of £1, d ,Baseline, etc., is Shuffle-Exeh- 
ange. The Shuffle function and Exchange 
function are two principal interconnecting 
functions [10,11] 3 

Shuffle (J)=(S(0),3(1),°°*,S(N-1)) 

Exchange (J)=(B(0) ,B(1),°**,B(N-1)) 
where S(p)=p 


B(p)=p m1 _2°°*PyPo» p=0,1 oe If 
one step of “routing transmits a canonical 
vector Y into its destination with the 
address vector F(J), then the network is 
said to have realized the F(J) function. 
peorem 6: n-k steps of routing +294 4238, 
pagers =< can realize the Cube, (3) 
function, where (J, edove**sd, _k) is mae 
permutation of (k,k+1,+°o+,n—1), 

“he n steps of routing with arbi- 

| trary araee can realize the Exchange (J) 
function. 
Theorem 8: For a canonical vector of Leng- 
th N, n steps of routing with any command 
order can not realize the Shuffie(J) fune- 
tion, but (2n-1) steps can, 


of Katonatie Veotor-Routing 
Some vectors have routing conflict , 
Such as A-ordered vector with even A. It 
has memory bank conflict, 
transmission route conflict. 
N elements of an address vector point to N 


therefore has 
Even if the 


different memory banks, there will still 
possibly exist routing conflict. Por 
example, the bit reverse transmission of a 
canonical vector ¥ has no memory bank con- 
flict,but has transmission route conflict, 
where Y= (Yoo, peee, Vy ,)» the bit reverse 


transmission means transmitting YD from By 
to E 


Pn-1 9 p=0,1 gece N— ‘ 
In order to realize dynamie and para- 
llel data alignment, a network must solve 


pe ’ P=Py1Py_2°**P Pos p' =PoP;***Py_o 
: qnseenaerenerunneequnememenenmaets beteeneenmeeeemeteenes 
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the conflict problem. The iteration method 
of automatic vector-routing which was pre- 
sented in paper [2] is an efficient method 
to Solve the routing conflict problem for 
SP2I network.fThe basic idea is as follows: 

In addition tok, we set up another 
vector register Fe (FoF, °° Fy) for 
information reservation. F, is in C, . The 
information structure of F. is the same as 
that of E, (see Fig.2). 

The whole procedure of accessing 3& 
vector consists of several routing itera- 
tions. Each iteration consists of n steps 
of routing and n steps of return-routing. 

First,the information to be transmit- 
ted is sent into F. fhen perform routing 
iteration. 

(i)F > F 

(41) E executes n steps of routing 
and n steps of return-routing. If conflict 
occurs, we allow Some valid information to 
disappear from o. 

(iii) If EB&=1, then EBA,=B, , OsFE, 
(1.06, E, of Ps)s otherwise B. and FE, 
remain unchanged, (j=0,1,+°++,N-1). 

(iv) If Fe, 4(0,0,°°°,0), then go to 
(1); otherwise the fetching operation is 
finished. 

The iteration method of automatic 
vector-routing may be extended to many 
multi-stage or Single-stage interconnec- 
tion networks to solve their routing conf- 
lict problems.For example, ADM [3] (Augmented 
Data Manipulator) network has 2n intercon- 
necting functions F,(j)=jt2+, j=0,1,+6%, 
N~13 i=0,1,¢¢+¢,n-1, Its controlling struc- 
ture also uses the routing distance as 
controlling tag. Thus, the above iteration 
method may be directly used in the AIM 
network without any modification. When 
rerouting is used in the AIM network, the 
iteration number needed for AIM network is 
usually less than that needed for SP2I. 

fhe iteration method may also be ex- 
tended to Such multi-stage Shuffle-—Exchan- 
ge-type networks as Indirect Binary n-Cube, 


2, of, Baseline, etc. [8,9]. In their 
controlling structure the destination 
address D, is used as routing tag of the 
Source address S, (js0,1,°°°*,N-1). In the 
i-th stage, if (D dats then the data is 
Switched into the lower output of the swi- 
teh elementsotherwise into the higher out- 
put.When extending the iteration method to 
these networks to Solve their conflict 
problems,the information to be transmitted 
consists of four parts as shown in Fig.3. 


data Source destination valid 


address address bit 
ee a ced 
n bits mtn bits 1 bit 
Pig.3, Information structure 
of RI, and F 
J j 

We set up an information input vector 
register RI and an information reservation 
vector register F. RI, and F, are in Cys 
Their information structure ig shown in 
Fig.35 . Suppose the CVCVHP system uses the 
above Shuffle-Exchange-type multi-stage 
interconnection network to realize dynamic 
and parallel data alignment, we will take 
the fetching operation as example to desc- 
ribe the extended iteration method. 

Suppose the MCU is going to fetch a 
vector Y¥ of N elements from VCM and send 
it into vector buffer register B, denoted 
as Y=>B. At first ,compute Y's address vec- 
tor TE(Dy,D,,°**,Dy_,) where D,=D! «2"+Ds ; 
Jers, RFD, EuPe - fhen perform routing 
iteration. 

(i) Routing: F> Ri. Ri> network. Di 
of RI, is used as routing tag. If at Some 
k-th stage, a switch element has two valid 
inputs with Source addresses i and j, and 
(D3) ,,= (D5), »then the two input information 
have to output at the same higher (or 
lower) output port of the switch element, 
ecauSing a conflict. In this case, we may 
choose only one input (such as always the 
higher input)to output and let the another 
to disappear (i.e., set its valid bit into 
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zero). Thus for this switch element, its 
one output port sends out valid informa- 
tion,another output port sends out invalid 
information. Finally the output of the 
network returns to RI e 

Then parallelly perform a fetching 
operation, i.e., if RIEs=1, then fetch a 
data from M, according to local address 
RID§ and send it into RIA, ; otherwise 
RIE ,=0, then RIA, remains unchanged, j=0, 
1,°°°,N—1, 

(ii) Return-routing: Ri> network. 
This time, S is used as destination addr- 
ess vector, S, becomes the routing tag. 
The transmission conflict is treated in 
the same way as (i). The output of the 
network returns to RI. 

If RIE,=1, then RIA .9B ’ ODFE 53 
otherwise RIC ,=0, then B, and Fe, remain 
unchanged. 

At last,examine Fe.If FE#(0,0,°+°,0), 
then go to (i); otherwise the fetching 
operation is finished, 

The procedure for storing operation 
is like that for fetching operation. But 
conflict-free storing operation does not 
need the return-routing. 

If in Indirect Binary n-Cube, Q,4d, 
or Baseline network, we only consider the 
permutation on the set of processor's ad- 
dresses (i.@. memory bank addresses), then 
the destination address D, only needs n 
bits. Data vector & moves from source ad- 
dress vector S=d into vector register is 
according to destination address vector D, 
that is just like a storing operation AJ. 
We can use the scheme described above to 
completely realize the permutation opera- 
tion. 


The iteration method is al1S8o0 appli - 
cable to other type Single-stage intercon- 
nection networks. USing the method enables 
users to access to data vectors of the VCM 
without very carefully considering the va- 
rious vector types. Vector access can be 


realized automatically by machine. Conf- 
lict-free vector access only needs one 
routing iteration. The problem which is 
worth further studying is how to decrease 
the iteration number. For example, for A- 
~ordered vector ¥ of N elements,if A=qe2?, 
q is odd, then the N elements distribute 
in N/2P memory banks. Every memory bank 
contains 2? elements of Y. Storing or fet- 
ching Y needs at least 2? accesses to the 
VCM.In SP2I network,if the routing command 
order is always 42° 42) 000, 4297! (both for 
routing and for return-routing), then the 
iteration number will be far greater than 
aP, But if we use the order +2°-',42"-? , 


o00,+2°,for routing;the order 42° +2) 000, 


4geo) »for return-routing,then the iteration 


number will reach the lewest bound 2? e 


Conflict-free implementation of 
Data Manipulat ion Functions 


Conflict-free routing is one of the 
most important problems of interconnection 
network and has been well discussed ina 
number of papers [5,6,8,9, ete | - Based on 
the properties of conflict-free routing 
discussed formerly, we will give some 


implementation methods in SP2I network 
used data 


for frequently 
functions, 


nanipulation 


When De=D+J*A ,and A=1 (mod 2),D and 
A are broadcast into all cells to compute 
D. Then only one routing iteration is 
needed for realizing the access, 

Suppose Y=(Y),Y, peeey Ye 4)s Yra(Yy P 
-e*,Y,»¥,) Then Y¥' is called as the rever- 
se vector of Y. If ¥ is a A-ordered vect~ 
or with odd A, then the Y¥' is also a 
A,-ordered vector with odd 4,, where 
A,=2"""-A (mod 2°*"), the initial address 
of Y' is D+(K~1)A (mod 2™**), Accessing 
to ¥' is conflict-free. When K<N, and Y is 
canonical, we can get yr through the 
following process: 
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(1) Compute the bit vector Ol (Oy 9% : 
oe yc Nei)? where OX, =1 (if j<K) or 050 (if 
j>K). This will be denoted as &=(J<K]. 
Compute ¢=K-1-23 (mod N). 

(44) YER, S2E0, asEe.Perform n steps 
of routing. EA25. Then (By»B,,°°*sBy ,)=¥". 
2. Matrix Transposition 

Suppose A=(8y sexx » K is odd, the 
Storage pattern makes every row vector be 
1-~ordered vector, every column vector be 
K-ordered vector. We want to get B=A'= 
(Os 5) xxx » where bs 577554 » the storage 
pattern of B will be the same as that of A. 

(a) Suppose A and B occupy the dif- 
ferent memory Space. Matrix transposition 
is to fetch a i-ordered vector A[i,*] and 
to store it into the space of an K-ordered 
vector B[¥,i],  i=0,1,°°*,K-1, i.@., to 
perform the following program[1]: 

O=>1 5 3 (RsA(1,*] 3H, ;K, =B[*,1]s Tsirists | 

(b) Suppose B will occupy the space 
of A. E,. and H, are used as two vector 
working registers, the address of or Oe! is 
used as initial address for both the i-th 
row vector and the i-th column vector to 
be fetched and stored.Perform the program: 
KaK' s0543(y 4s (gr sA(i,*] of, sA[*, 1] 9Rp3 
FaA[i,*] pH, oA(*,i]s J’ sittoisk'-19K"s | 
where the initial address of A[i,*] 
A[*,i] is D,=D+ik+i, 
address of A. | | 

(c) The case of A=(as y)uxn- As shown 
in Fig.4, transposition is done by exchan- 
ging of a, and B, which are the higher and 
lower subdiagonal vectors with the diagon- 
al vector 8)=05 as the axis of Symmetry. 

If the global address of G9 is 
D'e2", then @, is canonical and has local 
address vector D'=D'+3. The corresponding 
elements of a, and a, ,1 have the same 1lo- 
cal address,but their memory bank address- 
es are different by 1. On the contrary,the 
corresponding elements of By and By + have 
the same memory bank address,but their lo- 


cal addresses are different by 1.50 we can 


and 
Dis the initial 


adopt the following transposition peosey7t 
(i)Broadcast D’, D' +598, and 6p. 14. 
(ii) [Tra] +04, [Fut] +h. 
(414) teas ED. "Parallelly routing 
42°") EDta Be 6, » Fetch duu a, (whose local 
address vector is just in py *) and send it 
into EA."Parallelly routing +(W-i)". EAaH, . 
(iv) BottB a. Fetch out By (its leal 
address vector is just in ,) and Send it 
into EA. “Parallelly routing +i". EAaR, . 


(v) Under the control of re 
Under the control of a, ‘ i$ 5(B, )» 

(vi) itisi. If i2N, then end; 
wise go to (ii). 


other- 


Fig.4, Transposition of A=(a 
3. Compressing and Spreading 


Compressing i8 to Save memory space. 
Only non-zero elements of a vector are 
stored into memory. Suppose B=( By» Poe? P 
Cr ,) consists of the subscripts of non- 
-zero elements of Y=(Yp Y,» N- 4)» K<N, 
0<P)<P,<(<+*e<Pe_ Net ee ie vector 


is denoted as [2 BrY]=(¥p +¥p. ore Te ) « 


Spreading is the inverse operation of com- 
pressing,a@ spreading vector(sparse vector) 
is denoted as (@, Z)=(04°** 40425040840, 
Z4909°%% 40, Ze 4.905 eo*.0). 

Suppose p» By Ve (B, ¥] are all canoni- 
cal. Because the SP2I network only has the 
"routing +2?" function, i.e. ,right-routing 


i,j Nx 
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function,compressing to left is implement- 
ed by right-round-routing. If for i=0,1, 
coo,t—1, P,=13 for kt, Ook, this implies 
Yoo, pore, yt. 4 are non-zero elements, they 
do not need to move in compressing. But 
the moving of Te (k2t) m@y cause the Y, 


(O<is<t-1) to disappear. To avoid this con- 
flict,we first compress Y to right end, 
then parallelly transmit them to left end. 
As shown in Fig.5, to compress ¥ to right 


Yo Y, Yy eee Y, ooo & 900 % 200 


aie 


ae 
Yo Y, Y, * * 


Fig.5, Compress routing 


end satisfies the conditions of Theorem 4, 
On the command order of +29 42! 200 420m! - 
there is no routing conflict. Below is the 
compressing procedure: 

(a) Broadcast N-K. [icx]oo. Compute 

p-J J2d**, N-K-9 ay g™ 

(o) S*aRK Br+0Bb of vortorn n steps 
of routing +2°7',+29-? this is 
conflict-free according to Theorem 5 EAE. 

(e) YomA . Perform n steps of routing 
+29 42! ,000, 427! 

(4) “Parallelly routing +K" . Exas . 
Then (By,B,,***,By_, = 8,F] 


If we want to store [B,¥] into the 
memory space D,Dti,D+2,***, where the 
initial address D=D'+2"+D" , as shown in 
Fig.6, there are two cases.We only need to 


NY iN 


KEEKE D 
D! + eee 


geetgt2 


(1) D"<N-K (2) D">N-K 


Fig.6, Compressed into memory 


modify the (d) of the above procedure as 
follows: 

(After the process of (c)) | 
"“Parallelly routing +(D"+K)" . D'»> ED', 
[3<D"] a0. Under the control of a P D’+19ED", 
Then under the control of Ee, EA>(ED' ). 

The implementation process of spread- 


vector (@,Z) may be: 
Bid. 


(a) OB, [SK Joo. 

(b) ZonA F288, oe. Perform n steps 
of routing 420-1 oe 2 ose 42", Under the 
control of EE, EAsB. Then B=(@,Z). 


4. JNXJN-Square-Block Vector, 
Fanout, Replicate 

Here we suppose Wa2".27* , fi=2*=s, ‘ 
In order to thoroughly utilize the ability 
of parallel computation, in Some cases, we 
can cut a matrix A into many {Nx,/N-square- 
blocks,and every block as a “block vector" 
may be parallelly computed by N processors. 

Suppose A=(a, ,EXK P [K/s,] =¢ . Let 
g=c (if ¢ is odd) or c+i (if ¢ is even), 
and A=qeS,. A is stored into memory in the 
following storage pattern: for i=0,1, 
K-1, the address vector of the data vector 
(By ry grey x4 00r0ee%e,0) 18 (Drid, 


D+id+i ,eoe  D+iA+K~1 ,D+iA+K,*°*, D+tiAt+A-1) , 
where D is the address of By, 9? D sini be 
arbitrary.Then the Square block vector Ky J 
is defined as C25 gage 984, 545, 1 5 


Patty F441, 541 ae a nea 


rave. yt silane ieee Sits q7'sj+8,- _1)s 
whose address wctor ss jtH(t,3) oA 
+L(t,d), where H(t Deine ,0) ik 1)yees, 
H(t,N-1)), L(t F)=(L(t,0) ,E(t,1), +6, 
L(t,N-1)), D is the address of a, i It 
is easy to ee that j=D"-F satisfies the 
condition (*), So according to Theorem 1, 
Ky a has no routing conflict. 

| Suppose Y= (YoY, 90e%, ‘s,-1)° «The fanout 


vector F(¥) is defined as (YorYoot?*s los 


Y, 9X, dle ekg res, Ts -1 preety 4 dé 


The replicate vector R(Y) is defined as 


(Yor Ty ereeets 4s Y, ortye***s Y uaseeeere P 


1 
Yoo¥, ote ely ww Both F(¥) ana R(Y) have 


37=N siesscts . 


Suppose ¥ is canonical. 
be obtained as follows: | 

(a) [<i ]+ of (JN-1) ad". 

(b) YoRA, BA, dad , oaBE. Perform n steps 
of routing el poBH2 1. 00 

(c) Perform t steps of broadeast-type 
routing +2° 42" 5° oe oy baa .EASB. Then BaF (Y) 

Suppose Y is eanonical. R(Y) may be 
obtained as follows: 

(a) (T<WNloo . 

(bo) YaRA, aoBE . Perform t steps of 
broadcast-—type routing 42% 4204 eee oy me 
EAsB. Then B=H(Y). 

Besides, irregular fanout and repli- 
cate may also be implemented without con- 
flict. Due to the limitation of space, 
these materials are omitted here. 


The F(Y) may 


Conclusion 


This paper has discussed in detail 
the various properties of conflict—free 
routing of SP2I interconnection network, 
they are the basis to find out implementa- 
tion methods of data manipulation func- 
tions.The routing step number of implemen-~ 
tation methods presented in this paper is 
O(logN), the time needed for computing 
control-information i8 usually less than 
the routing time. SP2I network needs 
O(N*log.N) gates. Thus, SP2I has several 
advantages: network structure and control- 
ling are Simple, transmission rate is 
high, hardware devices are limited ina 
reasonable range. SP2I may efficiently 
realize the main data alignment functions 
faced by the CVCVHP system. The iteration 
method of automatic vector-routing, which 
is used especially for solving routing 
conflict problems, enables SP2I to easily 
realize various dynamic and parallel data 
alignments. The extension of the iteration 
method may practically and efficiently 


Solve the routing conflict problems in 
AIM, Indirect Binary n-Cube, Q, 0d ,Baseline, 
and other networks. 


Acknowledgements 


The authors gratefully acknowledge 
the most helpful comments of referees, 
Ms. Zheng Ya-Xian's patient and skillful 
typing is al80 sincerely appreciated. 


References 


[1] Gao Qing-Shi, Zhang Xiang, “A general- 
-Purpose Cellular Supercomputer---—Cel- 
lular Vector Computer of Vertical and 
Horizontal Proces3Sing with Virtual 
Common Memory", Chix 
Computers, Vol.2, No.1, 
pp. 1-13. 

Zhang Xiang, Gao Qing-Shi, 
of Automatic Vector-Routing and 
Iteration", } l 
Computers, Vol.4, 0.6, Ravenber 1981, 
pp. 459-467. 

Re ds McMillen and H. J. Siegel, “MIMD 
Machine Communication Using the Augme- 
nted Data Manipulator Network", J-th 
Annual Symp, Computer Architecture ,May 
1980, pp. 51-58. 

Gao Qing-Shi, Zhang Xiang, “Another 
Approach to Making Supercomputer by 
Microprocessors--~—Cellular Vector Com- 
puter of Vertical and Horizontal Pro- 
cessing with Virtual Common Memory", 


qanuaey 1979, 


[2] 


"Principle 
Its 


[5] 


ing, agent 1980, pp. 163- 164. 


Zo 


[5] M.c. Pease,"The Indirect aie n-Cube 
Microprocessor Array". [EEE 
pute, Vol. C-26,May 1977, pp. 458-473. 

[6 | D. He Lawrie, “Access and Alignment of 
Data in an Array Processor", IEEE 
frans .Comput.,Vol. C~24, December 1975, 
ppe 1145-1155. 

7|d. H. Patel, "Processor-Memory Inter- 
connections for Multiprocessors", 6-th 
Int'l, Annual Symp. Computer Architec-— 
ture, April 1979, Pp. 168-177. 

[a]c. wu and f, Feng, “Routing Techniques 
for a Class i Multistage Interconnec- 


Pp. 197-205. 

[9] c. Wu and 1. Feng, 
change Interconnection Network", 
Int'l. Conf. on Parallel Processing , 
August 1979, pp. 160-174, 

[10] H. J. Siegel, "Analysis Techniques for 
SIMD Machine Interconnection Networks 


"Phe Reverse-Ex-— 


AgT2 


and the Effects of Processor Address 
Masks", IBEE Trans. Comput., Vol. C-26, 
February 1977, ppe 153-161, 

[11] H. J. Siegel,"A Model of SIMD Machines 
and a Comparison of Various Intercon- 


nection Networks" , [EBB Trans, Comput., 
Vol. C-28, December 1979, pp. 907-917. 


DISTRUBUTED CIRCUIT SWITCHING STARNET* 


Chuan-lin Wu, Woei Lin and Min-Chang Lin 


Department of Electrical Engineering 


Austin, Texas 


Abstract -- Starnet is a communication sub- 
net which can cost-effectively connect hundreds 
or thousands of processors for distributed 
processing. It uses distributed control and 
circuit switching. Starnet's communication 
medium includes two major components: a multi- 
stage interconnection network and a set of 
interface units. The interconnection network 
uses a destination routing scheme with no 
central control. The interface unit provides 
handshaking between the computer/data node and 
the interconnection network under the control of 
a microporcessor. Detailed design of the com- 
munication medium is described. A model for 
comparing cost-effectiveness among Starnet, 
crossbar and multiple buses is included. 


I. Introduction 


Starnet is a distributed circuit switching 
local communication subnet which can provide 
flexibility required to cost-effectively solve 
general distributed processing problems. The 
area of local computer network architecture is 
concerned with interconnecting two or more com- 
puters within a restricted area such as a 
single building or a small cluster of buildings, 
to facilitate high-performance distributed 
processing. Although there are many local com- 
puter networks proposed or currently existing 
[1], increasingly sophisticated technology and 
enlarged problem domain have spawned a need for 
investigating networks which can provide higher- 
speed information processing in a new environ- 
ment enhanced by technological advances and 
new processing requirements. 


It is our goal to design a reconfigurable 
subnet which allows partitioning connected 
resources into any connection topology(ies). 
According to our need of efficient distributed 
processing the communication subnet consists 
of an interconnection network and a set of 
interface units (IU's). A block diagram of the 
communication subnet is shown in Fig. 1. Shown 

in Fig. 1 is also a multiple-IU assignment to 
system components, called nodes. The subnet 
realizes protocols specified in the first three 
layers of a hierarchical network architecture 
model [2]. In the first layer, the subnet 
specifies the mechanical, electrical and func- 
tional characteristics required to connect, 
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maintain, and disconnect a physical circuit in 
the paths between interface units. The second 
layer breaks data up into frames, provides des- 
tination address, transmits the frames, processes 
the acknowledgement from the receivers and 
handles errors if they appear. The last layer of 
the subnet handles controls of the subnet oper- 
ation. Its key design issues among others are 
routing and flow control. The architecture of 
the subnet is to implement these functions 
specified in the three layers in terms of its 
components: interconnection network and inter- 
face units. 


Worker 


Computer 


Interconnection 


Database 


e 
processor|® 


Fig. 1 A block diagram of Starnet | 


In section II, the design issues of the 
interconnection network are considered in two 
aspects: network topology and routing. Section 
III deals with switching elements of the inter- 
connection network. The interface unit design | 
is then exploited in section IV. An evaluation | 
on the bandwidth and cost-effectiveness is 3 
presented in section V. 


II. Network topology and routing 


A modified topology of baseline network [3] 
is used. The baseline network is a multistage 
interconnection network which can provide a full- 
connection. By full-connection, we mean that 
there exists a direct connection for each input/ 
output pair. Fig. 2 shows a generalized recur- 
sive process to generate baseline topology. In 
the recursive process, the first stage contains 
N/r switching elements of size r x t (i.e., r 
inputs and t outputs) where N is the number of 
inputs of the network. The second stage contains 
t subblocks: C,, C,,...., and C. ,. The process 
can recursively be applied to the subblocks until 
each subblock can physically be realized by an 
off-the-shelf switching element. A topology 


generated by setting r = t = 2 and N = 8 is 
shown in Fig. 3. The topology shows that there 
are three stages of 2 x 2 switching elements and 
there are 8 inputs and 8 outputs. 


Fig. 2 A recursive process to generate 
baseline topology. 
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Fig. 3 A 8x8 baseline network topology 


The label of the components (switching 
elements and links) of the interconnection net- 
work can be illustrated by the baseline net- 
work shown in Fig. 3. The stages are labelled 
in a sequence from 0 to n - 1 with O for the 
left most stage where n = log,.N. Similarly, 
the levels of links are (abetted in a sequence 
from 0 to n. The switching element in stage i 
is labelled by (Sp_1 Sp_9-.-51)4 where Sy: 

1 < j < n-1, is a base-r number. The input/ 
output links of a switching element is labelled 
by Sn-1 S,_9----51 SQ where S,_3 Sp_9---51 
denotes the label of the switching element from 
which the link spreads and Sg denotes the re- 
lative location of the link in one side of the 
switching element. In Fig. 3, there are three 
stages (0, 1, 2), four levels (0, 1, 2, 3), 
four switching elements (00, 01, 10, 11) in 
each stage and 8 links (000, 001, ...., 111) in 
each level. 


There is one and only one path existing 
between a source (input) and a destination 
(output). The unique path can be decided 
according to the destination address. Suppose 
that a source S is to be connected to a 
destination D whose address can be represented 


oa 


by a base-r number D,_j Dy_2-...D,D9. To connect 
the source to the destination, the switching 
element to which the source is attached, will 
take D,_1, and connect source to the next stage 
in terms of its D,-1 th output link. The 
switching element in the next stage will then 
take the next number in the destination, Dn-2, 


and use its D,_9 th output link to connect the 
source to the next stage. The process is re- 
peated until it reaches the destination. For 
example, assume that source S is to be connected 
to destination D, 101 in Fig. 3. Then switching 
element (01)9 will take the left most bit of D, 
1, and switches the source to switching element 
(10)1] using the lower output link. Switching 
element (10), will then take the next bit of D, 
0, and switches the source to switching element 
(10)>5 using the upper output link. The last bit 
of D, 1, will then be used by switching element 
(10)5 to connect the source to the destination. 


For the purpose of fault diagnosis and sub- 
net reconfiguration, it is necessary to know the 
routing path. The routing path can be derived 
from source and destination addresses. Let 
S = Spi Sn-2---5ySq and D = Dy_y Dy_p-.-D Do as 
used in the above paragraph. Then the switching 
elements on the path from the source S to the 
destination D can be denoted by (Dyp_1 Dp-2... 
Dn-i Sh-1 Sp_y ++ -Sitt)i for O< i<n-l. In 
other words, the links on the path can be denoted 
by (Dp-1 Dp-2 ++-Dn_-q Sn-1 Sp-2 ++ +Si41 
Dy-i-L) i41 for O < i < n-1. In the routing ex- 
ample shown in Fig. 4, S = 011 and D = 101 and 
switching elements (01)g, (10), and (10)2 are on 
the path. 


Since there is one and only one path exist- 
ing for an I/0 pair, the 1/0 pair can not be 
connected in case that there is a faulty element 
in the path or a conflict of using the switching 
elements and/or links occurs. In order to pro- 
vide the fault tolerance and higher availability, 
an extra stage is added to the baseline network. 
A modified baseline topology is shown in Fig. 4. 
The extra stage added and the baseline network 
will allow two connection paths for an I/0 pair _ 
as shown in Fig. 4. In general, if the switching 
element used in the extra stage has t outputs, 
an I/O pair can have t connection paths. 


Haw Ae 
aad a Oo 


8x8 


Extra aseline Network 


stage 


Fig. 4 A modified 8x8 baseline network. 


The routing scheme is the same as the one des- 
cribed above except for the portion of the extra 
stage. Since any output link of the switching 
element to which the source is attached in the 
extra stage will lead to the destination, the 
source can select one of the output links ac- 
cording to the priority policy and/or the state 
of the subnet. After an output link of the extra 
stage is selected, the aforementioned routing can 
then be followed to establish a path to the 
destination. 


III. Design of fault tolerant switching element 


Interconnection network is designed in a 
way that it can be constructed modularly in terms 
of. a single type of switching element. The 
switching element realizes communication proto- 
cols which specify control strategy and switching 
methodology [4]. In addition to the protocols, 
fault tolerance is justified by the fact that 
circuit complexity of the subnet can be at the 
same level as the complexity of the other part of 
the system. It is likely fair to say that a re- 
liable subnet is even more critical than other 
reliability issues. Here we describe a 2 x 2 
fault tolerant switching element which is to be 
used to modularly construct the interconnection 
network, The switching element uses distributed 
control and circuit switching. Its pin and gate 
count stays in the implementable range of VLSI 
technology. 


A block diagram of the switching elements is 
shown in Fig. 5. The switching element comprises 
of two major parts: control plane and data 
plane, which deal with control and data respect- 
ively. The control plane does the handshaking 
process in establishing connection paths. It 
generates control signals which are fed into the 
data plane to connect data ports. The data plane 
is the part where the data communication actually 
occurs. Depending on the word length required, 
the number of the data planes to be connected can 
accordingly be adjusted. 


reset prio tagl tag2 


ackl . ackl* 
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Fig. 5 A block diagram of a 2x2 switching 
element 
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The major function of the control plane is 
to set up the path between the source and the 
destination according to the routing scheme. As 
shown in Fig. 5, the control plane has the fol- 
lowing input control lines: address tags of des- 
tinations (tagl, tag2), request lines (reql, 
req2), acknowledge signals (ackl', ack2') release 
signals (rell, rel2), direction indicators (dirl, 
dir2), strobe lines (stbl, stb2), priority signal 
(Prio) and reset signal (reset). The output con- 
trol lines are relayed request lines (reql', 
req2'), acknowledge signals (ackl, ack2), dir- 
ection indicators (dirl', dir2'), and strobe 
lines (stbl', stb2'). The control circuit re- 
ceives input signals from two switching elements 
in the previous stage, generates control signals 
for the data plane, modifies the input signals 
and relays them to the next stage. The control 
plane has four internal registers to record the 
current connection status of the switching ele- 
ment. It generates internal signals to facili- 
tate conflict resolution and data plane control. 
With this design, a physical path of the modified 
baseline network can be established in one clock 
period, which include two clock phases 0, and $o. 


During phase $1, depending on the routing tag and 
the current state of the switching element, the 
request will be rippled down stage by stage. 


If no conflict occurs along the requested 
path, an acknowledge signal will be returned from 
the receiver. During clock phase 9, each 
switching element along this allocated conflict- 
free path would update their internal registers 
and set up the related connection in the data 
palne. At the end of phase 92, a physical path 


is actually established between the source and 


the destination if there is no conflict. The es- 
tablished path can remain still as long as the 
source wishes. The source can issue the release 
signal to disconnect the path. Before a better 
fault tolerance scheme can be finalized, Tripli- 
cated Modular Redundance, in which the circuit is 
triplicated, is used to enhance the reliability. 


The data plane is composed of a data circuit 
and a test circuit. The data circuit comprises 
of a number of duplicated copies each of which 
can have a general 2 x 2 crossbar connection for 
full-duplex communication. The test circuit al- 
ways monitors the response to on-line data and 
perform self-diagnosis asynchronously. If a 
fault occurs, the test circuit will adopt a re- 
covery procedure to reconfigure the data circuit. 
The test circuit comprises of three parts: test 
data generator, match detector, and recovery con- 
trol logic. Test data generator provides the 
idle link with auxiliary test data in addition to 
on-line data for fault detection. Match detec- 
tors compare input data (either on-line data or 
auxiliary test data) to the associated outputs. 
If there exists a mismatch, the corresponding 
error flag will be raised which will then trigger 
recovery control logic to replace the current 
copy of the data circuit with an error-free copy. 


The switching element can perform bidirec- 
tional fault-tolerant communication and broad- 


casting. The data propagation delay per switch- 
ing element is estimated to be 30 ns. 88 pins 
are required for a switching element which allows 
16-bit parallel half-duplex communication. The 
design is viable for VLSI implementation. The 
gate-level design details can be found in [5]. 
IV. Interface unit 

The system component is attached to the in- 
terface unit which in turn connects to multiple 
ports of the interconnection network. The func- 
tions of the interface unit can be divided into 
two parts. Part 1 provides mechanisms for en- 
abling communication between the system component 
and the interface unit. Its design depends on 
the system component. Part 2 facilitates access- 
es to the interconnection network. The design of 
the second part concerns the interaction to the 
interconnection network and is independent of the 
attached system component. 


Part 2 contains active and passive connection 
ports. The active port is connected to the input 
of the interconnection network while the passive 
port is connected to the output. Only the active 
port can initiate connection request. However, 
both active and passive ports can transmit data 
after a connection path is established. When 
an active ports needs to access a passive port, 
it places the address bits of the destination 
ports on proper data lines and raises the request 
signal. Through the routing procedure (in a net- 
work clock cycle), the request is either accepted 
or rejected as signaled by the acknowledge line. 
If rejected, try again or the alternate path. If 
accepted, a path is already established and the 
interface unit then starts the data communication 
using the handshaking process and performs error 
checking and possible error correction. Input 
and output buffers are provided for each port. 
The main function of this part is to provide a 
reliable data communication. The higher level 
protocols are implemented in the first part of 
the interface unit. 


As shown in Fig. 6 for a connection between 
a master computer node and a slave computer node, 
the input/output lines of the IU can be divided 
into two groups: 


IU to interconnection network - This side is 
directly connected to the interconnection 
network. An interface unit has two kinds of 
connection to the interconnection network: 
the upper links represneting the active port 
where the master node initiates the request 
to the slave node; the lower links represen- 
ting the passive port where the slave node 
sends the reply to the master node. 


(1) 


(2) Computer nodes to IU - The IU can be treated 
as an I/O device of computer nodes. Node 
will direct IU by passing orders via hard- 
wired interrupt lines and control codes con- 
tained on data bus. An IU requests nodes 


for service in a similar fashion. 
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Fig. 6 Connection of interface units 


In this section we present an interface unit 
design, which is based on the 2900 family [6]. 
The configuration of the bit-slice based IU de- 
sign is illustrated in figure 7. The 2900 family 
components employed in this design include (1) 
CPU-ALU and scratchpad register units, Am2901; 

(2) microprogram sequencer and controller, Am2910; 
(3) bipolar memory, Am2960 Series; (4) interrupt 
controller and support devices, Am2914; and (5) 
condition code multiplexer, Am2922., 


The IU internal structure as illustrated in 
Fig. 7 is elaborated as follows. 
(1) ALU (Am2901). With the 9 bits of microword, 
ALU is capable of selecting source operands, 
functions, destination registers and provid- 
ing various status outputs. 
(2) Microprogram Controller (Am2910): This micro- 
program controller is an address sequencer 
that is intended for controlling the sequence 
stored in the microprogram memory. Beside 
the capability of sequential access, it pro- 
vides conditional branch and five levels of 
nesting microroutines. 
(3) Pipeline Register: Pipeline register is used 
to improve the execution speed. It is added 
at the PROM output to allow the overlap of 
ALU operation and memory fetch process. 
(4) Interrupt Controller (Am2914): The Am2914 in- 
terrupt controller may be connected to pro- 
vide the capability of microprogram level 
interrupt. The occurrence of an interrupt 
causes a branch address, which is provided 
by mapping PROM, to be fed into microprogram 
controller. Such a vectored interrupt sus- 
pends the current routine and activates a 
specific interrupt service microroutine. 
After the interrupt service routine is fin- 
ished, the suspended routine will be resumed. 
(5) 


Microprogram Memory (Am2960): These PROM's 


are used to store a number of interrupt- 
driven microroutines, which handle various 
stimuli from outside. 7 
(6) Interrupt Mapping PROM (Am29751 and Am2913): 
These PROM's supply the 12 bit starting ad- 
dress of a specific microroutine according 
to the type of interrupt. 
(7) Condition Code multiplexer (Am2922): This 
multiplexer selects the desired status and 
feeds it into microprogram controller. 
(8) Control Register (CR): This 16-bit register 
is used to control data flow inside IU. 
(9) Output Data Register (ODR): This 16-bit reg- 
| ister holds the data to be exported to 
active port and/or passive port. 
(10) Input Data Register (IDR): This 16-bit reg- 
ister holds the data imported from active 
port or passive port. 
(11) Control Signal Generator (CSG): The CSG, com- 
posed of an 8-bit register and an 8-bit 
driver, drives handshaking signals, strobe 
signals, and request signals which are gen- 
erated from pipeline register. 
(12) Active Port Interface and Passive Port Inter- 
face (API and PPI): These two transceivers 
interface the internal output data bus to 
the external data buses DAgp_15 and DA'Q-15 
respectively. 
(13) Data Bus Driver DBD): This transceiver inter- 
faces the internal input data bus to the ex- 
ternal data bus Do_15 which is connected to 
the host. 
(14) Bypass Switch (BS): This switch is used to 
bypass the 16-bit data transferred to IU 
without microprogram interferring. 
(15) Strobe Switch (SS): As shown in Figure 4-7, 
this switch is used to bypass the. strobe sig- 
nals transferred to the interface unit with- 
out microprogram interferring. 


In the microprogram storage, there are a 
number of microroutines, which are invoked by 
the interrupt from outside. Each microroutine 
consists of a series of microinstructions that 
contain the control information over some hard- 
ware elements. Every time an interrupt occurs, 
a specific microroutine would be activated and 
executed by gating succesive microinstructions in- 
to pipeline register. Via the dedicated connec- 
tions to the corresponding hardward components, 
these microinstructions initiate a sequence of. 
hardware activities and make the interface unit 
be able to react properly. In correspondence to 
the hardware. components of the interface unit, 
the microinstruction can be divided into 11 dif- 
ferent fields. Each field is in charge of a 
specific hardware element. The width, 64 bits, 
is suitable for a Am2900 family based design. 
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Fig. 7 Configuration of interface unit 
V. Performance Evaluation 


In this section, two performance factors: 
bandwidth and cost-effectiveness (bandwidth/ 
cost), are examined to compare bus-structure net- 
work [7], Starnet, and crossbar switch. Both 
asynchronous and synchronous modes of operations 
are considered here. Following assumptions are 
made for these two modes of operations. 
Asynchronous mode: 

(1) Poisson arrival and exponential distribu- 
tion service time for messages are assumed; 
each input link has the same arrival rate 
xX, average message transmission time is 
1/u. 

(2) The time to setup a path is assumed to be 
small compared to message transmission time 
and can be neglected. 


Unbuffered systems are assumed; a request 
will be discarded if it is blocked of if it 


arrives at a busy input link. 


Synchronous mode: 


(1) Message length is fixed; the time to trans- 
mit a message plus the time to set up a 
path is defined as a cycle. 

(2) Each input link generates requests for mes- 


sage transmission randomly and independent- 
ly; the destinations of requests are uni- 
formly distributed over all output links. 


(3) The requests are generated synchronously; at 
the beginning of each cycle, each input link 
generates a request with the same probabil- 
ity Po. 

(4) The requests being blocked are discarded; 


the requests generated at next cycle are 
assumed to be independent to the previous 
ones. 


1. Bus-structure network analysis 


A network with logoN buses is used for com- 
parison, where N is equal to the number of com- 
puter nodes to be connected. Since the complex- 
ity of a unibus system can be approximated as of 
order N, the complexity or loggN-bus system is 
about O(N*log9N). For simplicity, we assume that 
under both asynchronous and synchronous modes of 
operations, the log gN-bus system is able to 
acheive its perfect condition which has band- 
width; 


m, when the number of requests of the sys- 
tem at a particular time is m, m<logoN; 
logoN, when m>logoN. 


where (0<m<N) 


2. Analysis for baseline network and crossbar 


switch 


Crossbar switch can be thought as a special 
case of baseline network with one stage and an 
NxN switching element. The analysis of baseline 
network thus can be applied to crossbar switch 
with slight modification. 


Asynchronous operation: 


The asynchronous operation of unbuffered 
baseline network is assumed to have the continu- 
ous time Markovian behavior, which means that the 
transitions of system states are timely contin- 
uous and the rate of transition to the next state 
depends on current state only. 


Fig. 8 shows the Markov chain of this model, 
the number of currently accepted requests is 
chosen as the state parameter. A new arrival of 


E 
2j+k=i 


acceptable request changes state i to state itl, 
with rate (N-i)-A>PPy(i), i=0,1,....,N-1; a de- 
parture of currently accepted request changes 
state i to state i-l, with rate iru, i=1,2,....,N 
where PPy(i) is the probability that a new re- 
quest will be accepted when a size-N network cur- 
rently has i accepted requests (i paths currently 
exist). If all PPy/2(i)'s are known, PPy(i) can 
be found in Eq. (1) where Ry,m is the total num- 
ber of possible arrangements of m paths in a 
size-N network. 
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Fig. 8 Markov chain for asynchronous baseline 
Operational model 


To find PPy(i)'s, we can start from the 
boundary condition PP (0)=1.0, PP2(1)=0.5, 
PP9(2)=0.0. Once the PPy(i)'s are found, the 
equilibrium probabilities, ST(i)'s, can be solved 
ao and the bandwidth of the system is 
u i-ST(i). 
i=1 


Crossbar switch in asynchronous mode is an- 
alyzed with the same model. The only change is 
that PPy(i)=(N-i)/N for i=0,1,....,N-1. 


Synchronous operation: 


The synchronous operation of baseline net- 
work and crossbar switch are modeled by assuming 
each input link generate a request with prob-. 
ability P, in every cycle. The probability that 
i requests are accepted at one cycle can be found 
in Eq. (2). 
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The first summation term is the probability 
that m requests are generated at one cycle, the 
second summation term is the conditional prob- 
ability that i requests are accepted when m re- 
quests are generated at one cycle, note that 
PPy(0) is equal to 1 for every N. 


The bandwidth of the system is 
N 
y 
i=1 Po N 
form N-[1-(1- pan 


i*PA(i), crossbar switch has a simplified 


3. Comparison 
Bandwidth: 


The bandwidths of the three networks are 
shown in Fig. 9. The analysis of baseline net- 
work and corssbar switch is verified by simula- 
tion for N=4,8,16. We can see from the figure 
that the bus network is not suitable when N is 
large, its bandwidth is limited by the number 
of buses. Increasing the number of buses is 
not practical since the increasing of one bus not 
only increases the cost of order N, but also in- 
curs more scheduling problems. The bandwidths of 
both baseline and crossbar networks are of order 
N, the bandwidth of baseline network falls down 
slowly as N increases, the bandwidth of cross- 
bar switch reaches a lower limit as N increases 
to infinity. 
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Fig. 9(a) Bandwidth of size-N networks, Pp=1.0 


for synchronous mode, i/u=1.0 for 
asynchronous mode. 


Cost effectiveness: 


To compare the cost-effectiveness, we assume 
that the control mechanisms of all three networks 
are implemented by hardware. The number of 
crosspoints of each network system is chosen as 
the cost index: bus system has N-log9N cross- 
points, baseline network has 2-+N-log9N cross- 
points and crossbar switch has N-N crosspoints. 
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This choice of cost index favors bus and cross- 
bar networks since the control logic for each 
crosspoint of these two networks is more complex 
than that of baseline network. The result is 
shown in fig 10. It can be found that baseline 
is the most cost-effective when N>64. The bus 
structure network is not able to support a large 
system for both reasons: few bandwidth available 
and not cost-effective. The crossbar has poor 
cost-effectiveness when N is large, also the 
tremendous complexity makes it very difficult to 
implement a crossbar switch with size over one 
hundred. 
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Fig. 10 Cost-effectiveness of size-N networks, 
Po=l.0 for synchronous mode, A/u=1l.o 
for asynchronous mode. 

VI. Conclusion 
This design shows that Starnet can provide 


a data access time of 1 microsecond in a local 
computer network with over one thousand computer/ 


data nodes. The adequate transport mechanism can 
thus provide better coupling among its nodes, 
compared to some contemporary multiprocessing 
systems. With its enchanced reliability and 
reconfigurability, Starnet has potential in being 
used for various applications including large 
scale real-time computation and office informa- 
tion systems. 
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Abstract 


This paper considers various levels of 
parallelism obtainable from sequential solutions 
for locating the eigenvalues of real symmetric 
tridiagonal matrices based on the bisection 
algorithm coupled with Sturm sequence evaluation. 
Three levels of parallelism are identified and the 
implementation of these three levels on three 
different parallel computer architectures is 
described. The three computer systems are a vector 
processor (the CRAY-1), an array processor (the 
ICL Distributed Array Processor) and an asynchron- 
ous multiprocessor consisting of 4 minicomputers 
linked through shared memory. Results presented 
confirm the theoretical analysis and show that one 
of the levels of parallelism, based on converting 
a standard linear recurrence relation is of use 
only for locating small numbers of eigenvalues 
using large number of processors. The other two 
levels, when combined, yield an effective 
algorithm for locating any number of eigenvalues 
on all three types of computer. 


O. Introduction 

The application considered here is the 
determination of the eigenvalues of a real 
symmetric tridiagonal matrix. 


This is an extremely important problem as 
standard sequential eigenvalue solvers first trans- 
form a real symmetric matrix to tridiagonal form 
by similarity transformations: the eigenvalues of 
the resultant tridiagonal matrix are identical to 
those of the original matrix. 


The original problem of N eigenvalues on 
interval R yields on evaluation of the associated 
Sturm sequences at m interior points of R, up to 
m+1 similar problems on smaller intervals. 
Repeated application of the technique isolates the 
eigenvalues onto smaller and smaller intervals 
until eventually the user required minimum size is 
reached. 


Parallelism can be introduced into the problem 
at three levels. Firstly since multiple independ- 
ent subintervals are generated parallelism over 
interval processing can be utilised. This solution 
has been implemented by Barlow and Evans (1978) but 
is inefficient when the number of intervals is 
small. Secondly, parallelism can be exploited 
within the interval by sampling in parallel a 
number of points in the same interval. Barlow et 
al (1981la) have reported on the implementation of 
a mixture of these two methods on two different 
parallel computers. Finally parallelism can be 
introduced within the Sturm sequence evaluation. 
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- Parallelism can be introduced at the three levels 


Thus, the Sturm sequence is defined by a linear 
recurrence relation and well known methods (see 
for example Kuck, 1978) are available to trans- 
form this apparently sequential relation into 
parallel form. 


Which level of parallelism is the most 
efficient to exploit depends upon the balance 
between the demand for various parallel resources 
from the different versions of the algorithm, and 
the availability and cost of these resources on a 
given computer system. To analyse this balance 
the paper first specifies the problem and then 
analyses the parallel properties of the three 
potential solutions. Sections 2,3 and 4 then 
report on the implementation of these solutions 
on three different types of parallel computer: an 
array processor (the ICL Distributed Array 
Processor), a vector processor (the CRAY-1) and 
finally an asynchronous multiprocessor based on 
four Texas 990/10 minicomputers linked via shared 
memory. 


1. Problem Specification and Analysis 

The solutions are all based on the classic 
method (Barth et al, 1967) of counting the 
negative signs of Sturm sequences derived from the 
matrix. Thus given a symmetric tridiagonal 
matrix, of size n, with real eigenvalues lying 
between A. and A then counting the number of 

min max 


negative signs of the Sturm sequences at a point 
do gives the number of eigenvalues lying below hae 


Since the interval can be sampled at an arbitrary 

number (m) of interior points the interval frag- 

ments itself into up to m+l smaller intervals | 
containing one or more eigenvalues. Application 
of the method to the new set of smaller intervals 
isolates the eigenvalues further and the process 
is repeated until the interval size is less than 
some user specified size. 


Sequential solutions sample each interval at 
only one interior point (the bisection point). 
mentioned in the introduction: 

a) Processing some or all of the current set 
of known intervals in parallel: each 
individual interval is processed as in 
the sequential solution. This level of 
parallelism requires the same number of 
samples as the sequential solution but 
has the deficiency of having idle 
processors in’ the initial iterations of 
the algorithm when the number of inter- 
vals is small. The maximum degree of 
parallelism is limited to the number of 
distinct eigenvalues (N) to be located 


b) 


c) 


and thus, assuming no overheads associated 
with controlling parallelism, the 
solution has a potential speedup (S) in 
the range 1<S<N. 


Evaluation of many sample points from one 
interval in parallel. This solution can 
exploit an arbitrary number of processors 
but is relatively inefficient. Consider 
an interval containing eigenvalues that 
on being sampled at m (equally spaced) 
interior points fragments itself into 
only one interval that contains all the 
eigenvalues. Thus, multisection has 
reduced the interval size by 1/(m+1) 
whereas sequential bisection can reduce 
the accuracy by this amount using only 
dn, (m+1) samples. It follows that the 


potential speedup of this solution lies 
between 2n,(m+1) and m: the latter 
reflects ia fragmentation of a single 
interval into mt+l intervals containing 
eigenvalues. 


Parallelism within the evaluation of the 
Sturm sequence for a single point. The 
sequential nonlinear recurrence relation 
is 


P; = ¢;~b;/p;_1> Pg = 1, Py = > 


(1.1) 


where C,=a, 7X, with x being the sample 


FOr 222 23¢ sys tis 


point and a; the “ia diagonal element of 


the tridiagonal matrix: b. is the square 


of the ag diagonal element of the tri- 
diagonal matrix. The number of negative 
siegens of Ps yields the number of eigen- 


values below x. Thus evaluation of this 
recurrence relation requires 3n 
operations. 

This relationship can be transformed into 
the parallel recursion relation 


do = l, qy = ai > ] 
q. 1 q 
dt ee eal onl ‘21a 
qi-1 J=t "| [50 
where, a 5. 
S, = J J 132.950 4.2N; 
J 1 0 
where 
pP. = “i 
* Gay 


This relation involves gn, (n) sequential 


stages, each stage consisting of between 
n/2 and n parallel subprocesses each of 
complexity 12 operations: thus stage l 
forms all products eee es for j=2,...,n, 


stage 2 combines these results to give all 
products oe *55_3)> for j=4, 


eee n etc. (Lambiotti, 1975). 


aq) 5 
Since each 
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De can be reconstructed by a single 


operation the number of eigenvalues below 
x can be computed in 12in,n+2 parallel 


operations using n processors. Thus the 
speedup of the parallel version over its 
sequential counterpart is ~n/(42n,n). 


Already from this simple examination of the 
schemes it can be seen that for the separation of 
large numbers of eigenvalues (N) the first 
solution offers the best potential. For small 
numbers of eigenvalues this solution has little 
parallel potential and the other two solutions are 
better. Since the first two solutions both apply 
parallelism over the evaluation of Sturm 
sequences of different sample points it is a 
relatively simple task to combine these two 
solutions to yield a single parallel solution with 
a broader range of application than either of its 
two parts. 


A further distinguishing feature between the 
solutions is the amount of synchronisation that 
the solutions require. Synchronisation is 
required in the third solution to ensure that all 
subprocesses of a stage are complete before the 
next stage starts, that is every 12 operations. 
In solutions one and two synchronisation is 
required at most every 3n operations; that is not 
more frequently than once for each Sturm sequence 
evaluation. 


Finally, there is the question of data 
communication. Thus for the first two solutions 
sample points must be provided to the processors 
and the results in some way compared. For the 
third solution equation 1.2 shows that between 
each stage the S; must be moved between the 


processors before the next stage can begin. 


The effects of synchronisation and data 
communication will be more fully discussed in the 
following section. 


2. Implementation on an Array Processor 

It 1s assumed that the array processor 
consists of a single control processor with p 
slave processors all of which execute the same 
instruction on different data. Any required set 
of slaves can be set inactive (masked out) on any 
instruction. The control processor can broadcast 
the same data to all of the slaves or pick out an 
item of data from one of the slaves. Slaves are 
assumed to be linearly interconnected so that all 
slaves in parallel can move an item of data to 
either their left or right-hand neighbour. 
Synchronisation is automatic on these systems. 
The ICL Distributed Array Processor (Flanders et 
al, 1977) on which the solutions were implemented 
has 4096 slave processors (each of which can 
process one bit at a time). 


Solution three, based on recursive doubling, 
can uSe at most n processors and thus for n<p 
some processors in the array are idle. Parallelism 
here involves the evaluation of the q. of equation 
1.2. This is done by evaluating different partial 


products of the Ss at different processors. Thus 


ar and be are stored at processor j; then all 
ee can be evaluated locally once the sample 


point has been broadcast. Each stage i of the 
an, (n) stages involves shifting the previous 


partial products (of S) 2” . places along the array 
followed by the combination of the shifted results 
with the previous results. The first i-1l of the 
shifted results are filled in with the old results. 
Although the arithmetic operation count is reduced 
significantly from 3n of the sequential scheme to 
12 &, (n) for this parallel form, the number of 


shift operations at (n-1) is linear in the system 
size. While data communication paths other than 
nearest neighbour are generally available on array 
processors it is clear that the cost of moving 
results will tend to dominate the. processing for 
large n. | 


While solution two can use any number of 
processors it is more efficient to combine it with 
solution one so that the maximum speedup potential 
is realised while at the same time utilising all 
the processors. While this may require some 
increase in data communication to allocate sample 
points to processors the results show that this 
overhead is small compared to the cost of evaluat- 
ing the Sturm sequence. 


Each iteration of the combined method consists 
of allocating a distinct sample point to each of 
the slave processors followed by the independent 
evaluation of the sample function at each of the 
processors. 


Allocation of the new sample points starts by 
detecting the set of non-empty intervals arising 
from the previous iteration. This is done by 
nearest neighbour comparison of the sample point 
results for points interior to an interval. For 
points adjacent to interval boundaries this is done 
by comparing the sample point result with the 
boundary point result carried over from the 
previous iteration. The full treatment of 
boundaries is described by Barlow et al, 1981, and 
it is sufficient here to note that new intervals 
adjacent to boundaries can be treated in exactly 
the same manner as new internal intervals. At this 
stage intervals containing eigenvalues but lying 
outside a user defined range of interest can also 
be marked as empty. Using parallel add and shift 
operations the intervals are numbered, in monotonic 
increasing order, and their total (N') obtained. 
Using the new multisection factor m=INT(p/N') the 
intention is to move the data of new interval i 
(centre point and because of boundaries its result) 
to the m processors j=(i-l1)mtl to j=im. This is 
done by either sequentially broadcasting the 
interval data and masking out all but m processors 
each broadcast, or by shifting the data of all the 
intervals in parallel. The latter involves first 
shifting all data leftwards until no interval 
needs be shifted further and then repeating the 
process for right shifts. The efficiency of the 
parallel shift can be grasped by noting that if 
the number of intervals is a constant between 
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iterations, then, at. most m-l right or left 
shifts are required. In the implementation 
dynamic choice between the two options is made. 


Each slave processor independently evaluates 
the sample function for its sample point. This 
Sturm sequence evaluation is identical for all 
points, except for the possible incrementation of 
the eigenvalue count: thus the test for a negative 
sign sets a mask that is then used to mask out 
processors for the incrementation operation. 


2.1 Results 


It can be noted that each individual 
processing element of the ICL DAP has an 
arithmetic (logical) power about 100 times less 
(equal) to that of its host. 


Table 1 indicates the power of the combined 
method in locating large numbers of eigenvalues: 
the theoretical results indicate a maximum 
expected speedup of ~N&n,(1+(4096/N)), offset by 
the limited power of the’ processing elements. 
Table 2 shows that for a matrix of size n=1024 
the parallel recursive method outperforms the 
combined method only when searching for up to two 
eigenvalues. For smaller (larger but n<4096) 
matrices the recursion method performs relatively 
worse (better). Theory predicts that for 
locating one eigenvalue of a matrix this size 
results in speedups of 2n,(4096)=12 and 1024/ 
(42n,,1024)=25 for respectively the combined and 
recursion methods: in practice the parallel 
versions were 7 and 3 times slower than the host 
computer due to the low power of the processing 
elements. 


The cost of data communication for the 
recursion relation was 15% for the example above 
while for the combined multisection it was 177 
for N=64 decreasing steadily to 1.5% for N=4096. 


3. Implementation on a Vector Processor 
Vector processors achieve their power by 
introducing pipelining into the various computat- 
ional processes. For our purpose it is sufficient 
to note that the time taken to compute operations 

on vectors of length n is 


T(n) = A + B(n-1), where p=A/B and p>>1. 


In practice this simple formulae may only apply up 
to limiting value of n corresponding to some 
vector register size of the computer concerned. 
Scalars are assumed to take the same time to 
process as a unit length vector: that is T(1)=A. 
This is a simplification since on vector 

processes that require complex data alignment net- 
works to be set in order to process vectors the 
scalar operation time will be significantly less 
than A. Implementations were on the CRAY 1S for 
which p ~ 30-50 (Peterson, 1979). 


Consider now the implementation of solution 
three: recursive doubling. The ordinary 
sequential recursion relation of 3n operations 


requires, 

Ts(n) = 3nA (3.1) 
since no vector processing is possible. The 
parallel recursive doubling solution requires 
gn,(n) stages of 12 operations with between n/2 
and n processes in each stage. In fact approx- 
imately 9ngn,(n) operations are required. The time 
to execute these instructions on a vector 
processor can be bounded below by making the 
assumption that all these operations can be put 
into a single vector operation: this would result 
in a time of 

Tr(n) = 9ngn, (n)B (3.2) 
Comparison of these equations shows that recursive 
doubling cannot yield results faster than the 
sequential solution if p<3gn,(n). For the CRAY 1 
this implies that parallel récursion cannot be 
faster than sequential recursion if n>2°~ assuming 
p=30. 


This is an interesting limit on all parallel 
algorithms that increase the combined computational 
complexity by a factor &n, (n) in order to generate 
parallelism of order n. 


Let us now consider a combined solution one 
and two. Imagine that we are searching for only 
one eigenvalue. Then it requires k iterations of 
bisection to reduce the interval size by 2k. Since 
only one interval is available the vectors are of 
length 1 and the time taken is 

Tb =kK2n(A) (3.3) 
If multisection is now introduced so that m=27-1 
samples are taken in the single interval then k/j 
iterations are required and the time taken is 

Tm = k/j 2n(A+mB) (3.4) 
Since the vectors are of length m. Before we 
proceed to minimise this equation with respect to 
j we can introduce the effect of multiple sub- 
intervals (N) into this last formulae on the 
assumption that sampling produces no more sub- 
intervals. Thus 3.4 becomes 


Tm = k/j 2n(A+N(2J-1)B) (3.5) 


Minimising this with respect to j one obtains 


A/B = N(2)(j*0n_(2)-1)+1) (3.6) 


3.1 Results 


Table 1 shows the CRAY 1 time to locate all 
the eigenvalues of some large matrix. The speedup 
comparison is with respect to the optimal 
sequential version (Barth, 1967) which since it 
evaluates only 1 sequential recursion relation at 
a time uses only the scalar functional units of the 
CRAY. 


Table 3 illustrates the improvement that can 
result from using multisection when searching for 
small numbers of eigenvalues. The minimum times 
occur for a value of the multisection factor that 


37 


is in rough agreement with equation 3.6. The times 
of the sequential algorithm are given under the 
multisection factor of zero. For the case of 
locating 1 eigenvalue using bisection (multi- 
section factor=1) the parallel algorithm has no 
parallelism and carries out exactly the same 
operation as the sequential version. However the 
parallel version uses the vector functional units 
and it can therefore be seen that scalars can be 
processed ~2} times as fast as vectors of length 
1. 


Table 2 compares parallel recursion with 
parallel multisection: the latter using multi- 
section factors determined dynamically in the 
program. Discrepancies between the results of 
Tables 2 and 3 arise from using different 
terminating accuracies. 


4. Implementation on an Asynchronous Multi- 


Processor 


Asynchronous multiprocessor computer systems 
are composed of processors capable of independent 
operation. On these systems similar operations 
may take different amounts of time to execute. 
Thus termination of one or more operations cannot 
be guaranteed by a hardware clock as in array or 
vector processors. Signalling the termination of 
parallel processes is thus an overhead on such 
systems. Furthermore it involves communicating 
information between the processors which requires 
that the processors must share physical and/or 
logical resources. Limited access constraints to 
shared resources imply the speedup is bounded by 
saturation of shared resources availability (see 
for example Barlow, 1982 or Barlow et al, 1982). 
Finally, synchronisation requires that on sub- 
processes of equal complexity faster processors 
must wait on slower ones to finish. 


The system to be considered consistsof four 
Texas Instruments 990/10 minicomputers linked 
through shared memory (Barlow, et al, 1981). The 
cost of synchronising the termination of paths 
is equivalent to 40 integer operations at a 
minimum. The synchronisation resource itself can 
only be accessed by one processor at a time and 
since this resource has a cycle time of approxi- 
mately 20 integer operations the speedup obtain- 
able is limited by S=(equivalent integer operations 
between access)/20. In addition there is an 
overhead associated with access to the data 
communication system (the shared memory) of 100% 
compared to accesses to local data. The shared 
memory being a shared resource also limits the 
maximum obtainable speedup. A deficiency of the 
current system is that it has no floating point 
hardware and thus these operations take ~40 
times longer to execute than integer operations. 
To supplement our analysis we include values (in 
brackets) that would result from a floating point 
to integer operation execution time ratio of 5. 


For the solution based on parallelism within 
the Sturm sequence evaluation the cost of 
synchronisation is extremely high as it is 
required once per 12 operations from each 
processor. Thus, the overhead due to synchroni- 


sation is 8% (67%) and the limit to speedup due to 
saturation of the synchronisation resource is 24 
(3): and this ignores the extra operations carried 
out in the parallel form of the solution. For small 
numbers of processors and large matrices it is 
possible to reduce the synchronisation by grouping 
subprocesses within a stage together: so that each 
processor takes n/p subprocesses. However it is 
clear that a minimum of 2n2(n) more synchronisa- 
tions are required than in solutions one and two. 
Furthermore from equations 1.1 and 1.2 it can be 
seen that the expected speedup from this solution 
can be at most 


s = p/(c&n2(n)) where 2gc <4 


and thus for this system this solution has 
nothing to offer. 


The results of a straightforward parallel 
implementation of a combined solution one and two 
are shown in Table 4. This implementation is 
almost identical with the array processor version 
of Section 2. Thus parallelism over intervals and 
possibly within intervals is exploited. The 
results show that a significant amount of 
processor time is wasted (82%) either by some 
processors completing before others or by an 
imbalance in the number of subprocesses to 
processors (a single multisection factor cannot 
always achieve an equal allocation of work to 
processors). 


Since waiting arising from synchronisation 
can be a significant cost on asynchronous systems 
various authors (Kung 1976, Baudet 1977) have re- 
designed algorithms so that they require no syn- 
chronisation. The algorithms do however sometimes 
require coordination (mutual exclusion) to ensure 
the integrity of certain program data structures 
shared by the processors: this coordination 
imposes overheads on access and limits to speedup 
for exactly the same reasons as the shared syn- 
chronisation resource. 


Following these ideas it is of interest to 
develop a form of solutions one and two that 
eliminates the synchronisation that forces fast 
processors to wait on slower ones. For solution 
one (bisection) this is simple since a processor 
sampling at one point of an interval can generate 
new intervals by carrying forward from the 
previous iteration the Sturm sequence results of 
the interval boundaries. The asynchronous 
algorithm can be formulated as: 


a) A list of intervals giving their centre 
point, size, left and right boundary 
number of eigenvalues. 


b) A process that collects the next interval 
from the list, evaluates the Sturm 
sequence at its centre point and then 
adds new smaller intervals to the list. 


Coordination between the processors is then only 
required to ensure one processor at a time access 
to the list. 
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Multisection (solution two) can be incorpor- 
ated into this solution by extending the structure 
of the list so that each interval becomes repres- 
ented by a tree structure. This tree structure is 
headed by the bisection interval information. 

This level (0) and lower levels then point to 
nodes that represent multisection points: 1/2 at 
level 1, 1/4 and 3/3 at level 2 etc., each node 
consisting of a pointer to its parent, space for 
two pointers to children and finally space for 
the result of the Sturm sequence evaluation for 
the point it represents. 


Pointers to nodes are only built when the 
point of that node has been taken to be sampled. 
Processors search the list/tree structure on a 
level by level basis, starting at level O, so that 
the amount of multisection as opposed to bisection 
1s always minimised. After completion of the 
evaluation the result is returned to the relevant 
node. If the node is at level 1 the tree splits 
to represent two new intervals, with each of the 
old level two nodes pointing to one of these 
intervals. Following a splitting operation the 
processor must try to recursively split the newly 
generated intervals since other processors have 
earlier completed sampling level two nodes from 
the old interval. Further details are given in 
Barlow et al (1981a). 


Results for this asynchronous solution shown 
in Table 4 indicate some improvement over the 
synchronous solution. However the cost of tree 
processing is severe (6.8%) and since only one 
processor at a time can access this list there is 
a limit (p=n/16 see Barlow et al 1982) to the 
potential speedup. 


For both synchronous and asynchronous 
solutions the rate of access to shared data is low 
because after having obtained the interval data 
the 2n operations of the Sturm sequence evaluation 
make no reference to shared data. Thus without 
floating point hardware the losses were too small 
to measure: they can be expected to be ~1% for 
matrix sizes of ~ 256 with floating point hardware. 


Conclusion 


The most striking feature of the results is 
the failure of the solution based on parallel 
recursion to yield any significant improvement 
over the sequential recursive solution. Our 
analysis shows that this failure is to be expected 
for large size systems. 


This result has ramifications for all parallel 
solutions that for a system of size n convert the 
original sequential algorithm of complexity c_n 
into a parallel algorithm that consists of F(n) 
steps with each step containing n subprocesses of 
complexity c,. For the resulting solution to run 
faster than its sequential counterpart then the 
number of processors that are utilised must 
satisfy p>(c,/c,)F(n). This is a severe limit- 
ation for processes based on pipelining as there is 
a limit to the number of stages and thus the 
effective number of processors. 


TABLE 1: ICL DAP and CRAY 1 Timings for Locating all the Eigenvalues 
Using Combined Multisection and Bisection 


ICL DAP 


SIZE 


64 
256 
1024 
4096 


x 
Speedup caleulated with respect to ICL 2980 (the DAP host) 


ax 
ICL 2980 verston ran out of ttme (CDC 7600 compartson tn brackets) 


AA 
Compared to CRAY sequenttal solutton 


TABLE 2: ICL DAP and CRAY 1 Timings for Locating a Small Number of 
Eigenvalues of a Matrix of Size 1024 


No. of ICL DAP CRAY 1 
Eigenvalues Solution Solution Solution Solution 
1+2 3 


0.034 secs.]}] 0.145 secs. 
0.0485 0.312 
0.0929 1.128 


TABLE 3: Effect of Varying the Multisection Factor on the CRAY l 
(Matrix Size 1024) 


Factor (m) Times (in seconds) to locate Eigenvalues 


Eigenvalue 4 8 16 32 


A 
Sequenttal solutton time. 


TABLE 4: Results for an Asynchronous Multiprocessor (N=16,n=256) 


Speedup with Synchronisation | Tree Processing 
Method 3 Overhead** (inc. | Overhead**(inc. 
processors* lockout) lockout ) 


Synchronous 
Asynchronous 


4 
Compared to sequenttal btsectton on one processor 
aA 


For the case of 4 processors 
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This conclusion has been previously pointed 
out by Lambiotti (1975). It has led Sameh and 
Brent (1977) and Chen, Kuck and Sameh (1978) to 
develop alternative parallel recursive solutions 
that have less parallelism but are more effective 
with small numbers of processors. These solutions 
are currently being investigated. 


Another interesting point that was discovered 
was that, firstly, inspite of the low potential 
gain on introducing parallel multisection within 
an interval and secondly the increase in processing 
time on the vector processor the method did yield 
improvements even when locating small numbers of 
eigenvalues. The failure of this method for very 
small numbers of eigenvalues on the ICL DAP 
reflects the extremely low processing power of the 
array elements. 


Finally, although the asynchronous solutions 
involving both parallelism within and between 
intervals yielded only a slight improvement over 
its synchronous counter-part the former solution 
can yield significant gains when the speed of the 
processors differ significantly. Thus in the 
synchronous version all processors are slowed down 
to the speed of the slowest processor. 
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Abstract -- Solving finite element problems 
on SIMD or MIMD systems raises implementation 
questions due essentially to non conflict free ac- 
cessing to data structures as they are commonly 
handled in finite element programs. These diffi- 
culties may be overcome by redesigning algorithms 
and partitioning the mesh into non connected sub- 
sets. After a graph modelization of the problen, 
the decomposition is related to a graph coloring 
algorithm, yielding the elementary tasks and their 
corresponding data which are allowed to run con- 
currently in a multiple processor system. The stu- 
dy is implemented on a general hardware and soft- 
ware MIMD simulator supporting a high level lan- 
guage and performance evaluation tools. 


Introduction 


The emergence of multiple processor systems 
raises numerous questions in many fields of compu- 
ter science : processor-memory organization, data 
allocation, task scheduling, programming languages. 
Other questions are of great concern when one wants 
to write real programs for those new systems. More 
specifically in numerical analysis, we are conduc- 
ting a study of parallelization in Partial Diffe-. 
rential Equations problems using finite element 
techniques. The environment consists in a high 
performance, MIMD system currently under specifi- 
cation, where algorithms and data organization 
have to be redesigned to achieve efficiency, since 
speed is based on concurrency and independence ra- 
ther than on vectorizing or pipelining techniques 
[1][7]. Data access conflicts occurring during the 
algorithmic step of discretization are solved by 
partitioning the whole grid into non connected sub- 
domains. 


The efficiency of this partitioning method 
over conventional ones is evaluated by utilizing a 
general MIMD simulator currently under development 
in companion teams. The MIMD system can be configt- 
red by choosing appropriate parameters for the num- 
ber of CPUs, local memory and secondary memory 
banks, the behavior of two communication networks 
linking CPUs and local memories on one hand, local 
memories and secondary memories on the other hand. 
A simulation language allows the high level expres- 
sion of tasks, data organization and allocation, 
and the expression of control among tasks which 
will be interpreted by the simulator's supervisor. 
Almost every part of the hardware and software for 
the MIMD system is user-definable, thus many stra- 
tegies can be rapidly set up and compared. 
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Conflict free data accesses 


in multiprocessor systems 


Finite element algorithms often require a step 
of linearization. During this step, a_ usually 
large matrix is assembled. Its size equals the 
number of unknowns in the mesh and varies with the 
problem approximation [2], [3], [4]. Once fixed 
the grid geometry, the assembly step yields two 
closely interacting phases, for each element 


- compute an elementary matrix representing the 
local nodal contribution of an element in the mesh 
to the global matrix (figure 1), 

~ accumulate the matrix elements into the global 
matrix supporting the linear system coefficients 
(figure 2). 


Let NE be the number of elements in the mesh, 
C(p) the elementary contribution matrix for element 
p, and A the global matrix. Obviously, C(p) and 
C(q), for any p and q e« {1,NE}* may be computed 
concurrently (phase one). Unfortunately, as seen 
in figure 2, they will eventually alter a same line 
of A during the second phase accumulating their 
terms into A. 


The conventional. step consists in computing 
one C(p), updating A and repeating it until the 
last element. An SIMD or MIMD system (MIMDS) could 
perform phase one in parallel, but updatings would 
have to be sequential, or strongly sequenced to 
avoid access conflicts to A. MIMD systems would 
achieve better performance than SIMD ones for phase 
one, since data are usually accessed through other 
arrays. Indirect addressing is known to be tedious 
for SIMDs, while independent processors (MIMDs) 
should accomodate it naturally. 


However, an MIMDS would be in trouble for pha- 
se two, since two or more updating tasks running on 
different processors will possibly perform load- 
add-store operations on identical elements of A. 
Thus a control must be defined in order to avoid 
conflicts. It will sequence the updating tasks such 
that at any time there should not be simultaneous 
operations on a given subset of A. 


The partitioning technique 


To solve this problem, we are searching a partition 
of elements into subsets Sj i=l,n, such that : 


ey {set of elements in the mesh} 


1 n Pati 56 ee kh 
cc = fa) 0, 1 car = fa a a 
, QB Be 2 
For triangle 3 For triangle p 
with nodes (k,&,h) 
J J6 FIGURE 1 : contribution matrices. 


(For Pl 


Then, for the triang 
(1) (2) 
Ae35 " Aras * Ages 


a.. = gi) + ao + g 
11 Li 11 Lt 


. Type triangulation with one unknown per point) 


les given above : 


>, 4 4 


ii ii 1i 


(superscripts denote the triangle number) 


FIGURE 2 : Accumulation of contribution matrices into the final linear system. 


(P1 triangulation) 


n 
- U S. = domain of integration 
i=] 


- for any E, element in the mesh, there exists 
one ie[l,n], Bes. 


- for any ie[1,n], any couple E,, E,«€ s?, 
1 
E, 9 En = 6. 


For all elements belonging to Sj, the set of opera- 
tions consisting of computing the contribution ma- 
trices (phase one) and assembling them into A (pha- 
se two) can be performed fully concurrently. By 
construction of the partition, no conflict occurs 
during phase two which is now mixed with phase one. 
Passing from Sj to Sj+] will still be synchronized, 
with the capability of anticipating Sj+4+] phase one 
during final stages of S; processing. 


The partition is determined by identifying 
each element in the mesh with a node in a graph. 
Two nodes are adjacent if they correspond to a cou- 
ple of neighbour elements, i.e. E, and E, such that 
E, 9 E, # > . The construction of subsets Sj of ele- 
ments in the mesh is equivalent to the problem of 
coloring the corresponding non planetary graph [5], 


[6]. 


_ One way to achieve optimal coloring is to find 
the chromatic number Y of graph G. Practically, 
this algorithm would be too much time expensive if 
one wants to include it into the normal processing 
of the global matrix computation. Instead, we used 
another one, derived from Powell and Welsh's theo- 
rem [8]. Let us denote dj the degree (or valency) 
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of a point vj in G, i.e. the number of lines inci- 
dent with vi, since G is not oriented. Then theal- 
gorithm produces a number of colors N < Max (dj)+1. 


The coloring algorithm can be stated as fol- 
lows : 
- Step 1 : nodes in the graph are re-ordered accor- 
ding to their decreasing valency. Let {pj}i=l,..,n 
the newly ordered list of elements. Thus p; has the 
largest valency. Let j=1. 


~ Step 2 : take element pj, = pj, and find all ele- 

ments ps, in the ordered list such that pj, is not 
: Jk ~ FY 

adjacent to any pjg, Q = 1,...,k-1. This forms 

partition Sj. 


- Step 3 : suppress all elements of Sj from the 
list. If the list is empty, halt the process, other- 
whise let j be the new first element and iterate 
step 2. 


Fig. 3 shows two applications of the algorithm. 


This basic coloring method is applicable to any 
mesh type and, so far, has been experimented on 
three different regular domain triangulation (PI- 
type). The results are shown on Table I for diffe- 
rent grid sizes. 


Additionally, the basic algorithm can be im- 
proved in two ways : firstly, the difference bet- 
ween the optimal chromatic number and the number 
of colors can be attenuated, and secondly, the num- 
ber of elements for each color can be balanced. 


We chose the last solution, thus from a parti- 
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tion, Cl, set up by the basic algorithm, we now build 
an improved one, C2, by performing the following 
changes : 


Step 1. The subsets of Cl are ordered accor- 
ding to their decreasing number of elements. Let 
Sj i=l,...,n be this list. 


Step 2. Some elements in S] are shifted into 

Sn when possible. The shifting process halts for S] 
and Sp when : 

. S] has now NE/n elements (balanced number). 
Then step 2 is done with S92 and Spy. 

. Sp has now NE/n elements. Step 2 is done with 
Sj; and Spy-1. 

. No element in S] can be transferred into Sy, 
and Sy has less than NE/n elements. Step 2 is done 
with S9 and Sp. 


We thus obtain a pseudo-uniform number of ele- 
ments in each color, close to NE/n. An illustration 
of this optimization gives, for the third mesh type 
in Table 1 whose deviation is rather large, the 


following improvements 

- for 106 elements = 0.7 (instead of 5. 7) 

~ for 232 elements = 0.7 (instead of 10.3) 

~ for 496 elements = 4.4 (instead of 19.6) 

~ for 756 elements = 8.4 (instead of 30.96). 


Simulation and results 


Our work, oriented to parallel numerical me- 
thods, is part of a joint study of multiprocessor 
systems at ONERA-CERT. Other people in our group 
developped a simulator of general MIMD architectu- 
res. Before giving a comparison of performance bet- 
ween several assembling methods, it may be helpful 
to define the overall characteristics of the simu- 
lator. 


The class of MIMD systems simulated corres~ 
ponds to figure 4 : 
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Secondary Memory 


modules 
Supervisor 
Q interconnexion 
network 
Migration 
processors 


Local memory 
modules 


Sequencing 


Elementary 
-Processors 


FIGURE 4 : MIMD architectures 


Data and code are initially located in Secon- 
dary Memory modules. A Migration Processor can 
access an SM module via an Omega-type asynchronous 
network, to build up data or code blocks and trans- 
fer them between SMs and LMs (Local Memory modules). 
Each Local Memory is attached to one MP and one 
Elementary Processor (EP). 


A program is made of a collection of tasks 
specified by their actual input/output data on one 
side, and their algorithmic part on the other side. 
Migration Processors execute the in/out part of a 
task, while Elementary Processors execute the algo- 
rithmic part and can be considered as conventional 


pipelined machines. 


Task sequencing is expressed in a parallel 
language describing concurrency and synchronization 
by a set of independent, non sequential formulas. 
The supervisor interprets formulas, decides which 
ones are firable and executes actions corresponding 
to activable tasks. It sends control signals to MPs 
for data management and to EPs for code execution. 
Although the simulator is fully parametrized, the 
results given further are obtained from the follo- 
wing specifications : 

- SMs : Random access memories, cycle time 300 ns, 

~ Omega Network : Routing time 100 ns, percentage 

of conflicts 30/100, 

Machine cycle time 300 ns, 

- LMs : 128 K 64 bit word RAMs, cycle time 100 ns, 

PEs : cycle time 200 ns (5 Mflops), 

k will range from 1 to 16, giving a peak rate of 
80 Mflops. 


~ MPs : 


We are now considering an application, taken 
from aerodynamics or structural analysis problems. 
A two-dimensional mesh is composed of 25600 trian- 
gular elements (and 19200 points). A single unknown 
in each point leads to four non zero coefficients 
in one line in the upper semi band of the final ma- 
trix (assuming it symmetric). This matrix is imple- 
mented as a sparse structure, where only non zero 
terms are stored, with a privileged access to lines. 
The elementary contribution matrices have six rele- 
vant coefficients. The geometry, i.e. the coordina- 
tes of nodes, is duplicated and included into the 
connectivity matrix which gives the correlation 


between the double numbering of elements and the 
nodes in the mesh. The execution time for one con- 
tribution is evaluated to 25 microseconds. We can 
now compare several assembling methods. 


First method : 


use sequential algorithm 


One processor is used. Its LM is loaded with 
800 lines of matrix A and local data for 200 con- 
tribution matrices. The EP iteratively computes one 
contribution and assembles it into A. The 800 lines 
of A are re-written into SM and another bulk of da- 
ta is loaded again. There are 128 such iterations, 
leading to the results in Table 2. 


## EPs 


Exec. time 
(in seconds) 
Percentage 
MP usage 
Percentage 
EP usage 


M Flops 


TABLE 2 : 


Simulation results 
(Sequential algorithm) 


5a 


Second method : 


Parallelize computation of contributions 


The set of all contribution matrices is now a 
data structure by itself, and lies in SM. We can 
split the first step of computation into parallel 
tasks performing a subset of elementary matrices. 
The second step of assembling them into A is still 
sequential. Table 3 summarizes simulation results: 


ee ee 


Exec 


Percentage 
MP usage 


Percentage 
EP usage 


Simulation Results 
(parallelize contributions) 


TABLE 3 : 


This method shows a little amount of paralle- 
lism in the first part of the program, however per- 
formance is not considerably improved when the 
number of processors is increased. 


Third method : 


Adding buffers for concurrency 


We keep the usual numbering of elements, and 
still distinguish the computation of contributions 
and their assembly into A. The first step can be 
considered as a producer process, while the second 
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will consume contribution matrices. The mesh is now 
divided into 22 subsets {Lj} i=1,...,22%. We can 
exploit the following concurrency : while elemen- 
tary matrices in subset Lo; j=l,...,% (respecti- 
vely L2j-1) are computed and stored into a buffer 
BUF! (respectively BUF2), those already computed 
in buffer BUF2 (respectively BUF1) canbe assembled 
into A, corresponding to subset Lo;_, (respective- 
ly L24-2)> Thus two levels of paral lelien appear 
here : 

- between computation and assembling, 

~ within computations. 


This version of the algorithm is a representa- 
tive trade-off of parallelization without dramatic 
algorithmic changes from the initial method. Simu- 
lation results are given below in Table 4. 


As can be seen, introducing buffers does not 
improve performance quite significantly. This is 
due to several factors 


- the management of buffers, which is explicitely 
expressed by the programmer, is time expensive and 
increases the overhead in the supervisor. 


- the consuming process, assembling the elementary 
matrices, is slow and is in fact dominant in the 
total execution time. Adding more buffers is there- 
fore needless. 


Exec time 
(seconds) 


Percentage 
MP usage 


Percentage 
EP usage 


M Flops 


TABLE 4 


: Adding buffers for concurrency 


Fourth Method : 
Use of coloring algorithm 


The mesh is now divided into eight colors with 
3200 elements in each subset. The coloring phase 
is a pre-processing step which prepares the compu- 
tations of elementary matrices and their assembling. 
Note that this step is not part of the normal pro- 
cessing, hence it is not included into the simula- 
tion times. The overhead induced by the preproces- 
sing is expected to be small, and may be minimized 
by parallelizing the coloring algorithm itself. 


The producer/consumer mechanism is maintained, 
however we now exploit an additional level of paral- 
lelism, since the assembling phase is made fully 
parallel for all elementary matrices belonging to 
the same partition. Table 5 gives the simulation 
results. 


gner to describe his geometry with a connectivity 
matrix, | . 

- the initial step in those methods consists in 
computing some data local to elements, and projec- 
ting them onto a global data structure representing 
the state of physical system, 

- when this assembling step leads to a matrix- 
type data structure, the next thing to do is a ma- 
trix inversion. Problems arise for this job, since 
that matrix is very large, very sparse with non 
zero terms more or less concentrated around main 


diagonal, giving a sparse or profile implementation. 
M Flops _ 3.8 16.2 | 30.2 | 56.7 = eee Py : F 


The simulation of the first phase let us hope 
that high performance could be achieved on a multi- 
processor system. As for the second one, if one 
uses the current widely used methods for direct in- 
version (like Gauss, Choleski), simulations did 
not yet reach sufficiently interesting rates to 
show that these methods are acceptable with minor 
changes. Thus new directions have to be discovered. 
The development of our simulator will allow us to 
Exec time (s) interpret more intricated synchronization graphs of 

tasks, corresponding to new methods well adapted 
to MIMD machines. 


TABLE 5 : Use of coloring algorithm 


Finally, the following diagram shows the per- 
formance of the different methods. Needless to say, 
the coloring one exhibits a notable gain over the 
others. 


sequential algorithm 
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Abstract -- The block conjugate gradient meth- 
od for linear equations is implemented to run on 
an MIMD parallel computer. The speedup of the 
parallel version of the method is approximately 
equal to the number of processors used, thus the 
method is well suited to run on a multiprocessor 
computer. Experiments have been performed on the 
Heterogeneous Element Processor manufactured by 
Denelcor, Inc. to validate the analysis and the 
code. 


1. Introduction 


"Plato taught that we do not learn new things; 
we merely remember things we have forgotten. For 
parallel processing, Plato's point is well taken" 
[7]. Indeed, the ideas of parallel computation 
have been around for a long time, but only recent- 
ly have we begun to design efficient parallel al- 
gorithms, write executable codes and experiment 
with real parallel machines. 


The subject of this paper is a block conju- 
gate gradient (BCG) method for solving linear 
equations on an MIMD (multiple-instruction multi- 
ple-data) parallel computer. Originally, the 
method was developed for sequential computing by 
Jennings and Malik £2], who tested it and found 
that it was more efficient than the standard con- 
jugate gradient algorithm. The reader interested 
in the method's numerical performance and its 
comparison with other iterative methods should 
consult Jennings and Malik. The block conjugate 
gradient algorithm attracted our attention be- 
cause of its structure which naturally lends it- 
self to parallel implementation on MIMD process- 
ors. To test numerically the parallel block con- 
jugate gradient (PBCG) method we have used the 
HEP (Heterogeneous Element Processor) computer 
manufactured by Denelcor, Inc. [5]. This computer 
represents a departure from traditional computer 
architecture in that it supports multiple instruc- 
tion streams (subroutines) executing cooperatively 
and in parallel to solve a single problem. An 
important feature of the computer is that these 
concurrent streams of instructions need not be 
identical. Moreover, there are means to syn- 
chronize the solution process, i.e., enforce tem- 
poral precedence constraints which are imposed by 
the nature of implemented algorithms. An extended 
Fortran language allows us to create concurrent 
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subroutines and synchronize execution of the meth- 
od. From the user viewpoint the HEP processor we 
used can be regarded as a collection of 1ls<p<9 
independent processors connected to a common main 
memory. 


Let us now consider a set of linear equations, 
Ax = b (1) 


where A is an nxn positive definite matrix, and 
x and b are vectors of the variables and right- 
hand sides, respectively. The standard conjugate 
gradient algorithm for solving (1) is as follows: 


Initial step. 
Set k = 0 


where x? is an initial approximation to the solu- 
tion, usually taken to be null. 


Iterative steps. 


(i) uk = apk 
k,T_k 
.. k_(r)r 
(ii) o era ae" 
(p>) ou 
(114) ght = xk + okph 
five toes ahr 
(v) Test convergence, and stop or continue. 
k+1,T k+l 
(vi) pk = ne 
(rk Tr 
(vii) pktt = nktl 4 ghpk 


(viii) Set k:=k+1 and return to (i). 
A suitable convergence criterion is || rX|l/|[bilse 
where € 1S a small number. 

To develop a block conjugate gradient version of 
this algorithm we assume that the variables x are 
divided into subvectors Xa x! = Ce 
and equation (1) is subdivided accordingly, 


sneer Te es 1 
eae eam, ee Ly, 
ee ee Jfe tf. de. @) 
Ant A207 * Mmm | | Sm Bn 


The diagonal submatrices are positive definite and 
their Choleski factors are denoted by Lis 


Let L denote a lower triangular block diag- 
onal matrix such that 


Pilg O 
22 
L = ; 
O 
Lm 
Introducing a new set of variables 
z= Ly (3) 
we obtain from (1) 
aut, = > 
-1 


and multiplying both sides of this equation byL 
we get 


Ltaul, = Lp 
or 

Bz =d (4) 
where 

p= Lan! (5) 
and 

d= Lb. 


The block conjugate gradient method is the stan- 
dard conjugate gradient method applied to equation 


Since 
I Cio ee Cam 
= Oa Sa ee ee C 
iv AL T 21 2m 
Came Se in tae. eel I 
where 
c.. = Lita. wz! 


ij ii-ij jj’ 


the expression us = Ap‘ yields subvector expres- 


sions of the form 


Ke pK ay tl -T_k 
ui = Py * Li (2 As jh 5qP3)- 


The vectors u. can be calculated in three steps: 


oe 


method in terms of the vectors Zz, p, ry. ds Ws Vv. 


Tk _ ok 
er <A es 
k k 
(b) Ly yyy = 2AM; 
k_ ok, ok 
ch MG 


These three steps replace the calculation uS=A k 
in the standard conjugate gradient method. Also, 
step (iii) is replaced by 


gL ok 4 gkgk 


The original variables x are calculated using 
transformation equations (3). 


Now we can state the block conjugate gradient 


~ 


and u which are subdivided in the Same way aS xX. 


The block conjugate gradient method. 


Initial step. 
Calculate Lis such that 


i Res i= 1,2,...5m 
and solve for d.. 
2. L..d. = bs, 1 = 1Ly2ice am 
Set 
k =0 
z = 0 
pos ro =d. 


Iterative steps. 


Perform the following sequence of calculations 
TK k 


3. Ls iW ~ Rye 1 = | eae | 
k k 3 
Ae dave = 2 Avs i= 1,2,...,m 
i k ~ 
5, us = P. wes 1 = D2 eiec:sm 
: 7 rk Fk 
Qa pea) ~ 9 
ky Tyk 
k+1 _ _k kk = 
7 Z. = Z. + Q Ps» 1 Is23 »m 
kt] _ ek k_k 2 
8 ~j _ V5 a Us > 1 ae 1625 om 


k+ + 
mt 1yT k+l | 
a kT T 


10. If convergence is achieved then go to 12, 


otherwise continue. Our convergence test is 


kt+1 
ae 


Ta * e, where « is a Suitable accuracy. 


kt] — k+l k_k oa 
11. p. =r, + BP, 1 = 1,2,... Mm. 
Set k:=k+1 and go to 3. 


Finishing step. 
Solve for Xs 
T 


: 1aXs = Z. 1 
12 Liik; a 


Le ee aes) 


and stop. 


The BCG method exhibits a remarkable degree of 
parallelism. Note that steps 1, 2, 3, 4, 5, 7, 
8, 11 and 12 decompose naturally and can be com- 
puted concurrently by m processors. Steps 6, 9 
and 10 can be implemented in a parallel fashion 
but they constitute a very small portion of the 
entire computational effort and have been imple- 
mented in our program on one processor. 


2. The Parallel Algorithm. 


For the purpose of our further analysis we make 


the following simplifying assumptions: 


a. n/m is integer and each block in the 
partition of A is of the same size 
n/m. 


b. every multiplication or additive 
arithmetic operation constitutes 
one computational step and all 
steps are equal in time length. 


c. all processors are identical. 
d. the matrix A is fully dense. 


e. the number of partitions m, and 
the number of used processors p 
are equal, p = Mm. 


The first assumption is simplifying but not es- 
sential. If n/m is not an integer then we assume 
that some submatrices A,, are of the size [n/m] 


and the remaining diagonal submatrices are of the 
size|n/m|+ 1. The second and third assumptions 
are correct for the HEP processor. Sparsity is 
not considered in this paper. However, the pre- 
sented PBCG method can be used to solve equations 
with sparse matrices and would run efficiently on 
multiprocessor computers for some classes of struc- 
tured matrices, e.g., banded matrices. 


The last assumption is not restrictive since 
the number of processors p is usually given a 
priori, and we can always use the number of parti- 
tions m = p. 


To design a parallel version of BCG, we break 
the BCG algorithm into a set of computational 
tasks, denoted by T. A task is a collection of 
computational activities (operations) and our con- 
currently running subroutines will consist of se- 
quences of tasks. The tasks of the BCG method are 
shown in Figures 1 and 2 which present this method 
as a pseudocode. Table 1 gives the operation 
count for each task. 


PROGRAM BCG (input:A,b, output:x) 
(1) INITIALIZATION i 
for i = 1 to m do 


CHODEC (input: A.. > output:L,. 


for 1 = 1 to m do 


FIND z using BCG iteration. 


CALL BCG(L,d,z) Wye 


(3) SOLVE the lower triangular systems 
LI.x, = z, by back substitution 


for i= 1 tom dco 


FIGURE 1. 
PROCEDURE BCG (input:L,d,e, output:z) 


(a) for i = 1 to m do 


T kK _ 
solve Ls aw. = 


suml = "eS; 


m 
p> 
i= 
m 
p> 


Sum2 = .2,Pu, 


sum1/sum? 
(d) for i = 1 to m do 


k+1  k+1 
Ys oV5 ) 


m 
j2yeS5 
pk = sum2/suml 
Convergence test. 
continue. 


1.to m do 


sum2 = 


Exit or 


for i 


n3/3m> + 3/2n%/m* + O(n/m) 
n@/mé 

on®/m - n= /me + 4n/m-2 
2m - 1 


6n/m - 1 


TABLE 1. 


Furthermore, we need to identify the time pre- 
cedence constraints relating execution of the tasks. 
With each new task T there are associated two, pos- 
sibly overlapping, ordered sets of memory cells, the 
domain De. and range Ry. When the task T is initiated 


it reads the values stored in its domain and writes 
values,into its range cells. We say that two tasks, 
T and T are noninterfering if either: 


(i) T is a predecessor or successor of T, or 


(ii) Ren Re = RADE = DART = B (empty). 

The pair of set of computational tasks and the 
partial order representing time precedence con- 
straints is called a task system and can be con- 
veniently represented by a directed, acyclic graph, 
without redundant (transitive) arcs. The task sys- 
tem of mutually noninterfering tasks of the PBCG 
method, for m = 3, is Shown in Figure 3 as the graph 
G. Note that if we execute various tasks T of G 
in parallel but follow the precedence constraints 
(execution of each task T commences only after all 
immediate predecessor tasks of T are completed) 
then the intermediate and final results of the 
computation will be exactly the same as the results 
of the sequential program. 
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FIGURE 3. Task execution precedence 


graph G for m = 3. 


Since we assumed that the number of processors 
and partitions are the same we can easily schedule 
the task execution. Figure 4 shows a schedule for 
m= 3. The shaded areas indicate idle periods 
for processors. 


Ban 77 


1 [72af -2b [2c] ~2d 
T 9 T, 


Time 


Processor 1 


7© | tEstI etc. 


WA 


3. 


Processor 2]{ T T T 


2 |'2 
1 | 2a 
Ale 


FIGURE 4. Task schedule for m 


Processor 3 


Giving weights to the nodes of the graph G 
according to Table 1 we obtain a weighted graph 
which has m maximum length paths from START to STOP. 


One of them is, for instance, START , Tt ,(ITR-1)-times 


the path i; qed Te? TEST Ta qe 


3 1 3 1 3 | 3 


Tog 50> lest STOP: The weights of the nodes 
START and STOP are zeros. The weight of the TEST 
task depends on the selected convergence criterion 
and is not included in our operation count. 


The sum of weights along this path, which is 
the maximum path length in the weighted graph, is: 


243m? m+ 1TR(2né /am+12n/m)-2n/m+ITR(3n-4) 


=n? / 3m 
The toal number of steps required by the BCG 
method is: 


TS ey ae 


ty = m(n3/3m>+ 2 n@/m°+ITR(2n° /me12n/m)-2n/m)-1TR 


where ITR is the number of used iterations. t, is 
the length of execution time for BCG on a uniprocessor, 
measured in steps. Thus the speedup of the parallel 
algorithm for the chosen schedule with m processors 
is: 
Pye 2 : 

m +ITR(2n° /m+12n/m)-2n/m-ITR 


/3m+3n°/ 
1. es ee 2 : 
m n~/3m tan /m-+ITR(2n°/m+12n/m)-2n/m+ITR(3m-4) 
Assuming that: (i) we solve sufficiently large 
systems, (ii) n> m, and (iii) ITR <<n, the value 
of S, is very close to m, which is the maximal 
Speedup achievable under ideal conditions. In 
reality, there is some loss of speedup due to the 
overhead in the parallel computing process. We 
have to create parallel subroutines, and synchro- 
nize their progress. Additional time may be needed 
for data transfer and potential memory contention. 
Also we have taken into account the arithmetic work 
but ignored other instructions, such as do-loop 
controls. 


Kumar [3] estimated that the time required to 
create and synchronize parallel subprograms in the 
PBCG method is: 


th = 2m(7 ITR + 2) - 10 ITR. 


Thus the total execution time for the PBCG method 
is at least 


ae ok 
oh oh a 


and the corresponding speedup is at most 
(6) 


3. Numerical Results. 
To test the PBCG method the following two types 
of problems have been solved: 


a. randomly generated positive definite 
systems. The matrices A and the vec- 
tors x have been generated randomly and 
then A has been made diagonally domi- 
nant. The right-hand side vector b 


has been calculated from b = Ax. 


uSing the five-point-star finite dif- 
ference formula an elliptic boundary 
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value problem has been converted to 
the problem of solving linear equa- 


tions [1]. The problem is: 
Vu - 2u = g(x,y) inside a unit square 
R, 0 <x <1, 0s y <1, and u = 0 on 
boundary of R, 

where 
g(x,y) = x@ty*-x-y-xy(xy-x-y#1). 


The problem has the solution 
u = 1/2xy(x-1)(y-1). 
The matrix for the second problem is highly sparse 
for large n, but our code does not take advantage 
of sparsity in storing or manipulating the elements 
of A. 
A sample of our computational results for several 


values of n and m is shown in Tables 2 and 3. To 
predict speedup we used equation (6). The actual 


execution times tt (one processor) and ae (m pro- 
cessors) are in seconds of the HEP computer. 


ec 


2 | 0.1157 | 0.0581 1.9961 
Ea 
3°} 6 | 0.2504 | 0.0460 | 5.8952 
Mele seen 0.0466 7.8447 
0.4409 3.9910 
Mie ieee 


Table 2. 


Random matrix. 


. 3 >m 
Predicted! Actual 


2 | 0.0366 | 0.0189 1.9923 
16 | 4 | 0.0542 | 0.0157 3.8877 
z 
36 | 6 | 0.3009 | 0.0562 | 5.0005 
64 re 1.0508 | 0.1387 | 7.9083 | 


Table 3. Boundary value problem. 


4. Conclusions. 

The computational results support our expecta- 
tion that the PBCG method is very efficient. The 
efficiency of a parallel method can be measured by 
the value of 


(7) 


and the efficiency of the PBCG is close to the opt- 
imal value E, = 1. The idle periods for m - 1 
processors are very short as compared with the total 
execution time. This compares favorably with per- 
formance of the parallel LU decomposition and Givens 
transformation methods for linear equations [4]. 


In our implementation the PBCG method is bimodal, 
7.e., either all processors are busy or only one. 
Of course, it is possible to use more processors 
for the computation of ak and gk but the resulting 
overhead could eliminate potential advantages. Bi- 
modal methods have been considered by Ware [6] and 
Worlton [7] who have pointed out that even a small 
amount of sequential processing can significantly 
reduce the effectiveness of a multiprocessor if 
the number of processors p is large. Assume, for 
instance, the p = 100 and only s = 1/100 of the 
entire computational work is done sequentially on 
one processor. The ideal speedup of 100 is reduced 
to 


Gi Se hg os. (8) 
s + (1-s)/p 


On the other hand for p = 10 we have only a small 
loss since 


510 = Dodd (9) 


Hence, if the execution units of the multipro- 
cessor with p = 10 are ten times faster than the 
execution units of the multiprocessor with p = 100 
we have tjg = 1/2ti99. This led Worlton to the 
conclusion that there is less risk in the use of 
multiprocessors having a small number of fast pro- 
cessors than there is in the use of multiprocessors 
having a large number of slow processors. Our 
experimentation with the PBCG method is an example 
illustrating the point. 
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A MULTI-COLOR SOR METHOD FOR PARALLEL COMPUTATION 
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Abstract* 


This paper considers a _ generalization of 
the classical red/black ordering of grid points 
for finite difference or finite element discret- 
izations of elliptic partial differential equa- 
tions. These "multi-color" orderings are shown 
to be effective in the implementation of the SOR 
iteration method on vector or parallel comput- 
ers. Examples are given of various orderings 
for different discretizations and implementation 
on the CDC Cyber 203/205 and the Finite Element 
Machine is discussed. 


Introduction 


We are concerned in this paper with the 
Solution of a sparse nxn linear system of equa- 
tions 


Ax = b (1.1) 


by iterative methods, especially SOR’ type 
methods, on parallel arrays or vector computers. 
As opposed to the Jacobi iteration, which has 
rather ideal properties for parallel computa- 
tion, the SOR method is essentially a sequential 
method. However, several authors (e.g. Hayes 
[1974], Lambiotte [1975]) have observed that if 
(1.1) arises fron a five-point finite difference 
discretization of Poisson's equation and the 
equations are ordered according to the classical 
Red/Black partitioning of the grid points then 
an SOR sweep may be carried out, in essence, by 
two Jacobi sweeps, one on the equations 
corresponding to the red points and one for the 
equations corresponding to the black points. 
Thus, in this case, the SOR method can be effec- 
tively implemented on vector or parallel comput- 
ers. 

This strategy does not work, however, for 
higher order finite difference or finite element 
discretizations or for more general elliptic 
equations which contain mixed partial derivative 
terms. In these cases, it is necessary to ger 
eralize the Red/Black partitioning of the grid 
points to a "multi-color" partitioning; for 
example, a three color partitioning, say 
Red/Black/Green, might give the desired result. 
In general, the number of colors necessary will 
depend on the connectivity pattern of the grid 
points. If p colors are used, an SOR sweep can 
be implemented by p Jacobi sweeps, one for each 
Set ot equations associated with a given color. 


*This research was sponsored by the National 
Aeronautics and Space Administration under grant 
nunber NAG1-460. The work of the second author 
was partially supported under NASA grant nunber 
NAG1~16394 while he was in residence at ICASE. 
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For vector computers, this reduces the effective 


~ vector length to O(n/p) while for parallel 


arrays it is necessary that each processor hold 
a multiple of p equations. This multiple will 
be determined by the particular discretization. 
Clearly, there will be a point of diminishing 
returns as p increases but for most differential 
equations and discretizations of interest it 
seems that no more than 6 colors will suffice 
and for the size of n we have in mind (n* 
10,000 + ), the multi-color strategy can be very 
effective. 

We note that multi-color orderings for SOR 
have been used before (see Young [1971]) but, to 
the best of our knowledge, have not been used in 
the context ot parallel computation. 

In the next section, we describe the method 
in more detail and in Section 3 we discuss some 
ot the implementation questions for both vector 
computers and parallel arrays. We do not 
address the many other problems in the success- 
ful use of the SOR iteration, especially the 
problem of determining an optimum relaxation 
parameter. 


Multi-Color Orderings 


For concreteness, we consider first an 
elliptic equation of the form 


Uy + ayy + Wy = f (2.1) 


on the unit square with Dirichlet boundary con- 
ditions where ais a given constant and f isa 
given function of x and ye We discretize (2.1) 
with the usual second-order finite difference 
approximations (see, e.g., Forsythe and Wasow 
[1900]) which give the difference equations 


a 
Meet, fiat, Se, 1, Jaa eT, He (5 9) 
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where h is the spacing between grid points, 
i, j=1..N where h(N+1)=1, u;, denotes the approx- 
imate solution at the i, jth grid point, and 
f..=f(ih, jh). Now partition the grid points by 
the Red/Black scheme, as indicated by Figure 1, 
and then number the grid points in each class 
fron left to right, bottan to top. 


R B R B 

B R B R 

R B RB 

B R B R 
Figure 1. Red/Black Ordering 


If a=0, so that (2.1) is just Poisson's equa~ 
tion, then it is well-known (see e.g. Young 
[1971]) that the difference equations (2.2) may 
be written in the partitioned matrix form 


= (2. 3) 


where Dis a diagonal matrix and u,. and u 

denote the vectors of unknowns associated with 
the red and black grid points respectively. The 
Gauss-Seidel iteration for (2.3) may be written 


as 


pyXt! = ~Bus + f 
Ket T k+1 : (2.4) 
Du 1 Bu. +f. 


and each part of (2.4) can then be effectively 
implemented in a parallel fashion, with the 
introduction of the SOR parameter causing no 
probl en. 

If a # 0, the form (2.3) of the difference 
equations is still valid although D is no longer 
a diagonal matrix and the Gauss-Seidel step 
(2.4) is no longer easily implementable ina 
parallel fashion. The problem is that unknowns 
corresponding to red points are coupled to each 
other in (2.3) (and black points to each other 
also) whereas when a=0 , they completely uncou- 
ple. Thus we wish to introduce another parti- 
tioning of the grid points for which unknowns 
within each subset of the partitioning are 
uncoupled. If we consider the grid point sten- 
cil for (2.2), shown in Figure 2, 
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R BR 
Figure 2. Stencil for (2.2) 
with the Red/Black ordering, we see that the 
center Red point is connected to the Red points 
at the four corners. If, however, we use four 
subsets of grid points, labeled red, black, 
white, orange, we can ensure that each center 
point connects with only points of different 


colors. A suitable coloring pattern for this is 
illustrated in Figure 3. 


R BW OR B W O 
WwW OR BW OR B 
R BW ORB W O 
W OR BW OR  B 


Figure 3. Four color ordering of the gridpoints 


In this case, the system (2.2) can be written in 
a partitioned form analogous to (2.3) as 


r 
Boi Po Bog Bom | Yb fy 65 
Bz, By D3 Bay | uy fy 
Buy Byo By3 Py Lo £, 
where D, and Dy are diagonal matrices, 
The pale al iteration in terms of (2.5) is 
then 
kK+1 _ k _ k _ k | 
Du, 7 cae “13 Bay, * i. (2 6) 
k+1 _ . 
Dough! = -Byyut'~ Bog’ - Boys + fi 
with similar equations for ut! and ue Since | 


the D. are diagonal, (2.6) is easily °impl enent- 
able on vector or parallel architectures, 

A variety of other connectivity patterns 
arise fran either finite difference or finite 
element discretizations. Two of the more common 
are illustrated by their stencils in Figure 4, 


(a) (b) 
Figure 4. Common finite element stencils 


in which (a) arises, for example, fran finite 
element discretization by piecewise linear func- 
tions over triangular subregions and (b) by 
piecewise quadratic functions, In case (a), 
three colors are necessary and sufficient to 
achieve the desired decoupling while in case (b) 
six colors are required. The coloring patterns 
for the two cases are illustrated in Figure 5. 


G RG RGR 

Oo B P B Y B 

G R GR G R 

B G R P B Y B OB 

G RB G RG RGR 

R BG Y B OB P G&G 
(a) (b) 


Figure 5. Three and six color partitions 


In both cases, the color patterns repeat beyond 
the subregions illustrated. 

A variety of other examples could be given. 
Provided that the domain of the differential 
equation is a rectangle or other regular two or 
three dimensional region and the discretization 
stencil is repeated at each grid, it is usually 
evident how to color the grid points to achieve 


the desired result. However, for arbitrary 
discretizations and/or irregular regions there 
is at present no algorithm to carry out’ the 
coloring. 


Imp] patiGuConaideratl 


We discuss briefly in this section some of 
the implementation considerations of the multi- 
color SOR method on vector computers and paral- 
lel arrays. For concreteness, we will use the 
CDC CYBER 203/205 as an example of the former 
and the Finite Element Machine at NASA's Langley 
Research Center as an example of the latter. 

On the CYBER 203/205, vectors consist of 
contiguous storage locations and the efficiency 
of the vector operations is strongly dependent 
on vector’ length. Maximun efficiency is 
achieved for very long vectors. For vectors of 
length 1000 around 90% efficiency is obtained, 
but this drops to approximately 50% or less for 
vectors of length 100 and less than 10% for 
length 10. Hence, we would like to keep vector 
lengths on the order of 1000 or more whenever 
possible. 

Consider, for example, the difference equa- 


tions (2.2) and suppose that h=.01 so that N=99 
and n=Ne~1094. The implementation of Jacobi's 
method on this problem can be done in a 


Straightforward way using vectors of length N, 
corresponding to the unknowns in each row of 
grid points. It is desirable, hawever, to work 
with vectors of length order n and it is possi- 
ble to achieve this by considering the boundary 
values to be unknowns and ordering all the grid 
points, including the boundary points, fran left 
to right, bottom to top and then applying the 
Jacobi iteration to the corresponding vector of 
length (N+2)* of unknowns. The boundary values, 
of course, cannot be changed by the iteration 
and this is prevented by use of the control vec- 
tor feature on the 203/205 which allows suppres- 
Sion of storage of updated values into the boun- 
dary locations. (See, e.g. Lanbiotte [1975] or 
Ortega and Voigt [1977] for more details on this 
procedure.) Since the calculation of new values 
corresponding to the boundary points is super- 
fluous, this introduces an inefficiency of 
approximately 4% for N=99 but allows almost full 
efficiency of the vector operations. 

For the Gauss-Seidel or SOR method for 
(2.2) we use the four-color ordering of Figure 
3. and order the unknowns into four vectors 
corresponding to the grid points associated with 
the four colors. The matrix-theoretic descrip- 
tion (2.6) of the Gauss-Sejdel iteration is then 
implemented by four separate Jacobi sweeps, one 
for each color. As above, the boundary values 
are considered as unknowns and then updated 
values suppressed on storage. Since the vector 
lengths are now on the order of 2500, the 
corresponding vector operations will run at 
about 95% efficiency. The introduction of the 
SOR parameter causes no difficulty. 

We turn now to parallel arrays. 
Element Machine (FEM) is a prototype array of 36 
microprocessors, arranged in a 6x6 grid. Each 
processor is connected to eight nearest neigh- 
bors, aS illustrated in Figure 6. 


The Finite 
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Figure 6. Processor Interconnections on FEM 


and there is also a global bus that connects all 
processors. Further details, which do not con- 
cern us here, may be found in Jordan [1978] and 
the references therein. 

Our primary goal in the implementation of 
the multi-color SOR method on the Finite Element 
Machine, or on.a Similar array with perhaps many 
more processors but limited processor to proces~ 
sor interconnections, is to keep as many proces- 
sors aS possible running at a given time. This, 
in turn, requires maximum use of the processor 
interconnections and minimum use of the global 
bus since contention for the bus will tend to 
introduce delays which cause processors to be 
idle. 

Perhaps the primary consideration in the 
implementation is to ensure that each processor 
holds at least as many unknowns as ae certain 
multiple of the number of colors where this mul- 
tiple is the number of rows above the center 
point in the gridpoint interconnection stencil. 
Thus, for example, if we consider the gridpoint 
interconnection stencil of Figure 4(a) and the 
corresponding three color ordering of Figure 
5(a), we would assign a minimum of 3 unknowns to 
each processor as illustrated in Figure 7(a). 
Similarly, for the stencil of Figure 4(b) and 
the corresponding six color ordering of Figure 
5(b), we would assign a minimum of 12 unknowns 
to each processor as illustrated in Figure 7(b). 


Pspon Pes [BOA]. 


Figure 7(a). Processor Assignment 
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Figure 7(b). Processor Assignment 


In the simplest case of 36 processors and 108 
grid points, with 108 corresponding unknowns, 

the assignment schene of Figure 7(a) would be 
sufficient and the SOR method would be imple 
mented by Jacobi operations, first on all the 
Red points, then the Black, then the Green. To 
carry out these Jacobi operations, current, 
values of neighboring unknowns would be obtained 
either from the processor itself or a neighbor 
processor and no use of the global bus is neces- 
sary. Known boundary values would be stored in 
the processors which needed then. In any prob- 
lem of interest, however, there would almost 
certainly be many more grid points and unknowns 
than processors. For the situation discussed 
above with three colors, we would assign unk- 
nowns in multiples of three to the processors. 

Similarly, for the grid point stencil of Figure 
4(b) and corresponding six color pattern of Fig- 
ure 5(b), we would assign unknowns in multiples 
of 12 to each processor. 

The above assignment strategy would allo 
each processor to run without waiting except for 
two problems, synchronization and convergence. 
Consider a Jacobi operation on all the unknowns 
of a given color. The processors may complete 
their work on this operation in different times 
due to a number of factors: slightly different 
clock times in the processors; different memory 
access times, especially for those processors 
containing unknowns’ connected to _ boundary 
values; different numbers of unknowns assigned 
to processors and so on. To compensate for 
tnese possivle aifferences in processing tiies, 
the computation can be synchronized by having 
each processor set a flag when it is done with 
its calculation on the current Jacobi operation 
and then wait for all other processors to com- 
plete. This synchronization, of course, intro- 
duces delays. Alternately, the processors can 
run asynchronously. In this case, the numerical 
iterations will tend to deviate fran the true 
mathematical iteration, although the conse- 
quences of this may even be beneficial. (See, 
e.g Baudet [1978] and the references therein 
for further discussion of asynchronous iterative 
methods. ) 

It is, of course, necessary to check for 
convergence of the iterative process. At the 
end of each SOR iteration, each processor can 
monitor the- convergence of the unknowns assigned 
to it, probably by comparing the current and 
previous iterates. When the convergence cri- 
terion has been satisfied for all unknowns 
assigned to a_ given processor, that processor 
must continue the iteration until the conver- 
gence criterion is satisfied in all processors, 
Hence, the whole process will not terminate 
until all unknowns have satisfied the conver- 
gence criterion and towards the end of the pro- 
cess a portion of the processors may be doing 
unnecessary work. This seems to be an unavoid- 
able inefficiency. 
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Summary and Conclusions 
The multi-color SOR method described herein 


seems promising for vector and array processors 


although practical experience to date has been 
limited to a few numerical experiments on a 
four-processor version of the Finite Element 


Machine. It faces the usual difficulty with the 


SOR method of obtaining suitably good values of 
the overrelaxation parameter and for most appli- 
cations of current interest for which a vector 
computer or large array would be used, there is 
little theory to help in this choice. For 
irregular regions, there is also the problem of 
processor assignment and coloring of the grid 
points; the processor assignment problem has 
been addressed by various authors (see, eg. 
Bokhari [1979] and Gannon [1980]) but not in 
conjunction with the coloring problem. 
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Abstract -- In many applications, it is neces- 
sary to perform the computationally intensive task 
of extracting the roots of a high order real poly- 
nomial. Parallel approaches to the root-finding 
problem are summarized. A new SIMD (single in- 
struction stream - multiple data stream) algorithm 
is described. The algorithm is a parallel imple- 
mentation of Graeffe's method. It can employ a 
number of processors less than or equal to the de- 
gree of the polynomial. The p-processor algorithm 
achieves an 0(p) speedup over the corresponding 
serial algorithm. This compares favorably with 
other iterative parallel root-finding algorithms, 
which have typically used fewer processors than 
the SIMD Graeffe's method, and which have exhibit- 
ed O(Cnumber of processors) speedup. 

1. Introduction 
signal 
becomes 


In many applications, including digital 
processing and automatic control, it 
necessary to extract the (possibly complex) roots 
of a high order real polynomial equation. Polyno- 
mials of degree 10 to 25 are not uncommon; polyno- 
mials with degree as high as 100 are sometimes en- 
countered. In this paper, the application of 
parallel processing to the root-finding problem is 


examined. Proposed techniques are summarized, and 
a new parallel algorithm is described. 
Since the conventional techniques of root- 


finding usually involve variable Length iterations 
and repetitive root extraction, they generally do 
not map immediately to the parallel domain. The 
main concerns thus become: can the problem be 
fairly partitioned among a Large enough number of 
processors to gain a reasonable speed-up, and can 
the interprocessor communications be sufficiently 
minimized? In addition, methods that have been 
discarded for serial computation need to be recon- 
sidered for parallel computation if they are easi- 
ly partitionable. 

For a given application, a number of properties 
of the root-finding method must also be taken into 
consideration. These include the following: Can 
the method extract only real or both real and com- 
plex roots? Can the method handle multiple roots 
at the same location? Does the method encounter 
problems of numerical stability or overflow under 
some conditions? Does the method require a good 
initial estimate of a root's Location in order’ to 
converge? 

In Section 2, approaches to using parallelism 
to extract the roots of a polynomial are dis- 
supported by the 
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cussed. In Section 3, a specific parallel algo- 
rithm is presented. The particular root-finding 
method described is one which has the required 
properties for both parallel implementation and 
scientific (Cin particular, signal processing) ap- 
plication. The attributes required in a parallel 
machine to implement the algorithm and the compu- 
tational characteristics of the parallel algorithm 
are discussed. 
2. Approaches to Parallel Root-Finding 

There are two principal ways in which root ex- 
traction is performed. The first of these is 
domain division. This consists of partitioning 
the domain over which roots may occur and then 
searching for roots in the individual subdomains. 
For example, after the domain is partitioned, 
Muller's method [6] could then be used to find the 
roots in each subdomain. Some parallel methods of 
solving partial differential equations ([e.g., 4] 
may be applicable to parallel root-finding algo- 
rithms based on domain division. 

The second principal method of root-finding is 
the iterative approach, in which successive ap- 
proximations to the roots are obtained. Examples 
of this approach are illustrated in (5, 7, 8]. 
Parallelism can be applied to such algorithms ei- 
ther (1) to attempt to reduce the number of itera- 
tions performed or (2) to reduce the execution 
time per iteration. Parallel methods by Miranker 
C9, 10], Feldstein and Firestone [10], Shedler 
C11], and Winograd [£15] have attempted to reduce 
the number of iterations in such methods as 
Lagrange extrapolation C10], Hermite interpolation 
[10], and Newton-Raphson (11]. At each step in 
the iteration for finding a given root, p proces- 
sors obtain p different, independent approxima- 
tions for the root. From these, the best approxi- 
mation to use in the next step is derived. In 
such methods, it has been shown that the number of 
iteration steps is reduced by a factor of log p 
when p processors are used; values of p considered 
(e.g., in £11J) have typically been small. 


3. A Highly Parallel Root-Finding 

In this section, an iterative root-finding al- 
gorithm is presented in which parallelism is used 
to reduce the execution time per iteration. The 
algorithm is a parallel implementation of 
Graeffe's method C6]. This method is not commonly 
used with serial processors since it is slow in 
comparison with other serial algorithms. The 
method can find both real and complex roots and 
can be adapted to find roots with multiplicity 
greater than one. 


Graeffe's Method 


Graeffe's method is based upon forming a se- 
quence of equations whose zeros are the squares of 
the zeros of the previous equation in the se- 
quence. This is done to separate the roots in the 
equation so that they can be obtained by solving a 
set of Linear equations. For example, consider the 
polynomial equation 


x? + a4x° + aox + az 


p(x) 


(x-Z4) (x25) Cx-Z3) = 0. 


If the magnitude of Z4 is much larger than_ the 


magnitude of 257 which is in turn much Larger than 


the magnitude of 23, then p(x) is- approximately 
Bis oe re S 
x" ~Z4X +24 25x 242523 = 0. 


Thus, Qa, = “Z4- Ag = 2Z4qZq = ~AqZo, and az = 

The root squaring processes is based on forming 
the product of p(xdp(-x) (-1)", where n is the de- 
gree of the original equation. This results in a 
polynomial of a degree twice that of p(x), but 


with only even powers of x. If each x2 is replaced 
by x, the result is an equation of the same degree 
as p(x), but with roots that are the squares of 
the roots of p(x). This procedure is repeated un- 


tit eacn coefficient in an equation is the square 
of the corresponding coefficient in the previous 
equation, within a desired tolerance. 

Assume that k squarings (iterations) are needed 


to satisfy the tolerance requirement, and Let 


m=2K The final equation can be solved to give 
the magnitudes of the m-th powers of the roots. 
Substitution is used to find the actual roots. By 
examining the form of the final few equations in 
the sequence, one can determine the types of roots 
(real or complex) that are in the equation. In 
general, if zs and 254 are a complex conjugate 
root pair, then the coefficient of x J will os- 
cillate. Once this oscillation meets certain 
tolerance requirements, the magnitude of the roots 
and the cosine of m times the phase angle can _ be 
determined. The actual angle must be determined by 
trial. 

The principle computation in Graeffe's method 


is in the repeated evaluation of p(x)p(-x)(-1)". 
Let the current equation be given by 


n n~1 
Box + Bax + oe. + Bx + BL (1) 
and let the next equation in the sequence (after 
replacing x2 by x) be given by 
n n-1_— . 
Ox + Cyx + eee. tt C4-1% + Ch ‘. (2) 


It has been shown [6] that the j-th coefficient, 
C., can be evaluated from the previous set of 


coefficients by the equation 
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For each C., the evaluation stops when the sub- 


scripts on the required Bs fall outside the range 
of the coefficient set. 

Unlike most iterative root-finding methods, 
Graeffe's method has the advantage of computing 
all of the roots in parallel and of having a basi- 
cally parallel structure. This algorithm for 
evaluating one set of coefficients from the previ- 
ous set is the basis of the parallel approach. 


ee ERS ee ET 


Based on the explanation in the previous sec- 
tion, the general algorithm to implement Graeffe's 
method will be of the form: 


begin(findroot) 
k = 0 /xk is iteration number */ 
while (termination criteria not met) 
begin( loop) | 
evaluate next set of coefficients 
k =k +1 | 
end( Loop) 
solve for roots 
end. (findroot) 


The heart of the algorithm is the finding of the 
new set of coefficients. Therefore, this part of 
the algorithm will be considered first. 

In the parallel implementation, each processor 
will compute one of the coefficients for the next 
equation in the iteration. Thus, for ann degree 
(n+1 coefficient) polynomial, p = n+1 processors 
can be used. (A smaller number of processors’ can 
be used with a corresponding increase in execution 
time.) During a given iteration, the same opera- 
tions are performed to obtain each coefficient, 
but on different data, so SIMD (single instruction 
stream - multiple data stream) parallelism is in- 
dicated. The SIMD machine model used will consist 
of a control unit, interconnection network, and p 
PEs (processing elements), where each PE is a 
processor-memory pair [12]. The p PEs are num 
bered from 0 through p-1, with O denoting the 
rightmost PE and p-1 the Leftmost PE. Each PE 
will initially contain one of the coefficients of 
the polynomial. It will be assumed that PE j, 
0 < j <n, contains the coefficient of x J 
at each iteration, PE j holds B., then Ci. 


a 1eCey 


The procedure in Fig. 1 computes the new set of 
coefficients from the old set. In the algorithm, 
Lcycle and rcycle denote the execution of inter-PE 
transfers. In cycle, the value in PE j is 
transferred to the variable of the same name in PE 
j+1. The transfer occurs simultaneously for all 
j- The value that was in PE p-1 is lost, and a 
zero is shifted into the transfer variable in PE 
O. Recycle is similar. 

Fig. 2 illustrates the data movement in pro- 
cedure "coefficient." In each iteration, the coef- 
ficients can be found in parallel using a sequence 
of Ln/2ijt+1 steps, in which each step consists of 
two transfers, three multiplications and one’ sub- 
traction. The data transfers required are from 
each PE to its two nearest neighbors. The actual 


procedure coefficient (old,new) 
/x input: old coefficients in variable "old" 
output: new coefficients in variable "new" 
"old" in PE j is B, in eqn. (1) 


"new" in PE j is C, in eqn. (2) 


x/ 
local variable a, ,r; 


2 = old; /* B obtained via Left shift */ 
r = old; /*x B obtained via right shift */ 
a = old**2; /x will accumulate result */ 
for q = 0 until Ln/2s do 
begin( loop) 
Lcycle@®); /*x left shift */ 
reycle(r); /k right shift */ 
a=an-2 x (-“1)**qg *2 * Pr 
/* add next term in sequence */ 
end( loop) 
new = a; 


end. (coefficient) 


the next set of 


Procedure to 
coefficients. 


Fig. 1. compute 
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0 0 0 re Ss fy, Pz 
Fig. 2. Variable movement in procedure "coeffi- 


cient," for n= 6, p = 7. 


steps required to perform an Lcycle or rcycle 
transfer will depend on the hardware organization 
of the particular parallel machine. A transfer 
can be done in one pass through most interconnec- 
tion networks, with appropriate masking. The time 
to perform a transfer will therefore be small. 
(Barnes and Lundstrom report a 120 ns connection 
time for a 10-stage multistage network [2]. In a 
system with single stage ring or nearest neighbor 
connections, transfer time could be expected to be 
even less.) Depending on the relative time to per- 
form arithmetic and transfer operations, it may be 
possible to overlap the network transfers with the 
computations. In this case, the time incurred by 
the data transfers will be negligible. Overhead 
is also introduced by the fact that the cwn/2J+1 
steps are performed in each PE, even though all of 
the coefficients do not need this many steps. The 
Lcycle and rcycle functions shift zeros into the 
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"edge" PEs to nullify the effect of the extra mul- 
tiplications and subtractions. These multiplica- 
tions by zero are steps performed in the SIMD al- 
gorithm that are not executed in a serial algo- 
rithm. In one iteration, the (nt+1)-PE SIMD algo- 
rithm will perform 3(tn/24+1) multiplications and 
Ln/2j+1 subtractions, compared to 3tin/2J3(n-tn/24) 
multiplications and tn/24(n-tn/2J5) subtractions 
required in the serial algorithm. The speedup on 
arithmetic operations is therefore approximately 
p/2 for the p-PE algorithm. 

The next step to be considered in Graeffe's 
method is the determination of whether or not the 
termination criteria have been met. Fig. 3 de- 
tails an algorithm for this step. There are two 
cases: non-oscillitory and oscillitory. For the 
first, the difference between the magnitude of the 
current coefficient and the square of the previous 


coefficient is compared with the termination 
tolerance. This is done in one comparison’ step, 
performed simultaneously in all PEs. As a result 


of this step, those PEs in which a real root has 
been Located to sufficient accuracy can be identi- 
fied. If the non-oscillatory tolerance has_ been 
met in all PEs, there is no need to test the cri- 
teria for oscillatory tolerance. In the algorithm 
in Fig. 3, this condition is tested in the "if 
all" statement, which has value 1 if all PEs 
satisfy the stated condition. The way in which 
the "if all" is performed will depend on the 
parallel architecture. If it can be accomplished 
efficiently, this capability to evaluate a_ condi- 
tion across all PEs can, in some cases, eliminate 
the need to test the oscillating criteria. Such 
statements (if all, if any) are implemented in 
PEPE (14]. A possible implementation for the PASM 
multimicroprocessor system is described in (13). 
For oscillatory termination, one approach to 
determining whether or not to terminate is to ob- 
tain the phase angle based on both the new and old 
sets of coefficients. If the phase angle is the 
same, within a specified tolerance, for both coef- 
ficient sets, the criterion can be considered met. 
If oscillating termination is to be tested, each 
PE obtains, via lcycle and rcycle transfers, the 
current and previous coefficients from its two 
nearest neighbors. The coefficients 
(B54 /B5,B, 44) and (C54 C5544) are used to 


determine if non-oscillatory tolerance is met in 
PE j. In the algorithm, PEs in which oscillatory 


tolerance is to be tested are enabled (and PEs 
which met non-oscillatory tolerance are disabled) 
by means of a “where” statement. The "where" con- 


struct is a data conditional mask (1, 33 in which 
each PE evaluates the condition using its own da- 
ta, and sets its active/inactive status so that it 
is active for the statements following the "where" 
only if the condition is true. PEs in which the 
condition is false are disabled for those state- 
ments. At each iteration, the test for oscillato- 
ry tolerance is performed simultaneously in all 
PEs which fail the non-oscillatory tolerance test. 


After the coefficient sequences’ satisfy the 
termination criteria, all of the real roots can be 
found in one parallel step and all of the complex 
roots can be found in one parallel step. This is 
detailed in Fig. 4. The complete root-finding al- 
gorithm is given in Fig. 5. 


procedure tolerance(old,new,n_osc,tol_ok,osc) 
/* input: 


old = previous set of coefficients 
new = current set of coefficients 
output : | 
tol ok = bit vector indicating where 
~ tolerance was met 
n osc = bit vector indicating where non- 
~ oscillatory tolerance was satisfied 
osc = bit vector indicating where 


oscillatory tolerance was satisfied 
Oscillations are detected using 8 computed 
from old and new coefs: 


Brmold, Brmnew = Br 


m = 
28 cos (md )=(E /E 4) 


2m_ 
B. =(E 44 /E 4) 


where E; can be either B; or C 


j’ 
and if this is iteration k, m=2*; 
O<r < p-1 

x/ 

local variable %old,rold,& new, rnew; 

tol _ok = n_osc = osc = 0; 

where (error (old**2,new) < tol _criterion1) 

~~ tol_ok = n_osc = 1; 

/x if the new coefficient is the square of the 
old one within a specified error, then the 
criterion is met. This check is done in 
parallel in all PEs */ 

if (not all (n_osc)) then 

begin(oscillatory) 


Lold = old; 

rold = old; 

rcycle(rold); /k BCj+1) x*/ 
icycle@old); /*x BCj-1) */ 
rnew = new; 

&new = new; 

rcycle(rnew); Mx CCjt+1) */ 
Lcycle (new); /*.CCj-1) */ 
Brmold = sqrt(Crold/& old); 

Brmnew = sart (rnew/2% new) ; 


where(error (arccos(old/@old * Brmold*2)), 

arccos(new/(new * B rmnew*2))/2.0) 

< tol_criterion2) 

tol ok = osc = 13 

/*x Combine the two equations involving B to 
determine a possible 9 for both of the Last 
two coefficient sequences, and compare the 
two values of 6. This is done in parallel 


in all PES that did not satisfy the squar- 
ing tolerance check */ 
end(oscillatory) 
end. (tolerance) 
Fig. 3. Procedure to determine if termination 


criteria have been met. 


procedure findroot(new,n_osc,tol_ ok,osc,z,mag,ang) 
/* input: 


new = elements of coefficient array 
n_osc,tol _ok,osc -- as in procedure tolerance 
output: 

z -- location of zero if root is real 


mag, ang -- magnitude and angle if 
root is complex 
x / 
Local_variable &,r; 
tol ok = .not.(osc .or. rcycle(osc)) 

7* PEs not involved with oscillatory cases 
(right shift is due to the fact that a 
complex root affects two adjacent coeffi- 
cients) */ 

z = mag = ang = INVALID; /* no roots yet x*/ 
lcycle(Q; /* new(j-1) */ 
rcycle(r); /* new(jt1) */ 
where (n_osc .and. tol_ok) 
begin(non-oscillatory) | 
/*x all real roots are found in parallel x«/ 

z = (new/ ® ** (1/(2**k)); 


/* (new/ 9)" x/ 
where (abs value(p(z) > tol _criterion3) 
z = -z; /* test by evaluating p(z) */ 
end(non-oscillatory) 
where (osc) 
begin(oscillatory) 
/* all complex roots are found in parallel x/ 
mag = (r/px*x(1/(2*(2**k)))3 
/* magnitude of complex pair */ 
ang = arccos(new/ (2% gesqrt (r/2)))/(2**k); 
/* possible angle */ 
end(oscillatory) 


end. (findroot) 
Fig. 4. Procedure to compute the roots from the 
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final set of coefficients. 


program parallel root (new) 
/* input: new = coefficients of p(x) 
output: roots 
*/ 
k = Q; 
repeat 
old = new; 
coefficient (old,new); 
tolerance (old,new,n osc, tol _0k,o0sSc); 
kK=k+ 
until Call (tol _ok)); 
findroot (new,n osc, tol. ok ,OSC,z,mag,angle); 
end.(parallel | root) 


Program to perform parallel Graeffe's 


method. 


Fig. 5. 


4. Conclusions 

Although exact comparisons between this. paral- 
lel method and serial methods are difficult due to 
the data-dependent, iterative nature of the algo- 
rithms, some general comparisons can be made 
between the parallel and serial versions’ of 
Graeffe's method. First, the number of iteration 
steps performed will be the same as in the’ serial 


algorithm. However, within each step, the coeffi- 
cients for the next equation in the sequence are 
computed in parallel rather than serially. The 


approximately p/2 
The less 


computational speedup will be 
for computing the coefficient sequences. 
than ideal speedup is due to redundant operations 
required for the parallel algorithm. Computation- 
al speedup in evaluating the termination condi- 
tions will depend on the relative number of real 
and complex roots, and will range from = approxi- 
mately O.2p to 0.9p. Speedup on computations for 
finding the roots from the final set of coeffi- 
cients can be up to p/2, depending on the relative 
number of real and complex roots. The execution 
time will be clearly dominated by the computation 
of the coefficient sequences. The interprocessor 
communications required in the algorithm are from 
each PE to its two nearest neighbors. The ratio 
of arithmetic steps to transfer steps in each 
iteration is approximately 5:1. However, unless 
transfers are significantly slower than arithmetic 
operations, it will be possible to overlap’ the 
time spent in inter-PE communications with the 
time spent performing computations. Overall 
speedup will therefore be dictated by the speedup 
on computations, and will be on the order of p/2. 
This compares favorably with other iterative 
parallel root-finding algorithms, for which imple- 
mentations using p processors have exhibited O(log 
p) speedup. For these methods, values of p con- 
sidered have typically been smaller than the num- 
ber of processors used by the SIMD Graeffe's 
method presented here. 

In summary, a new parallel algorithm to perform 
root-finding has been developed. The approach 
taken in the algorithm differs significantly from 
that of other parallel root-finding methods. 
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OPTIMIZING THE FACR(2£) POISSON-SOLVER 
ON PARALLEL COMPUTERS 
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Abstract-- A two parameter description of 


any computer is given that characterises 
the performance of serial, pipelined and 
array-like architectures. The first 


parameter (r,) is the traditional maximum 


performance in megaflops, and the new 
second parameter (ny) measures the 
apparent parallelism of “the computer. The 


relative performance of 
the same computer, depends only on ny, and 


the average vector length of the 1 g0- 
rithm. The performance of a family of 
FACR direct methods for solving Poisson's 
equation is optimized on the basis of this 
characterisation. 


two algorithms on 


Parallel Computers 


A two-parameter description of the 
performance of any computer can be 
obtained by fitting the best straight line 
to the measured time, t, to perform a sin- 
Gié VECTLOL Operation on vectors of varying 
length , n, (e.g. A=B*C, where A, B and C 
are vectors). A similar description of 
computer performance has been developed by 
Calahan, Ames and their coworkers at the 
University of Michigan (see [2] and the 
references therein). Our work below 
differs in the definition of parameters 
and the use made of them. Two equivalent 
generic forms for the straight line define 
two primary and one useful secondary 
derived parameter: : 

t= 


r, (nény) (1) 


where 


r : (maximum or asymptotic perfor- 


mance) the maximum number of 
elemental arithmetic operations 
(i.e. operations between pairs 
of numbers) per second, usually 
measured in megaflops. This 
occurs for infinite vector 
length on the generic computer. 


(half-—performance length) the 


vector length required to 
achieve half the maximum perfor- 
mance. 


ny: 


Alternatively, when n<n,, the generic line 
may be more usefully expressed as: 


0190-3918 /82/0000/0062$00.75 © 1982 IEEE 
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t = n*(1¢n/ny) (2) 


where 
TT: (specific performance) or per- 
formance per unit parallelism, 
is defined as the ratio ro/Ny,- 
2 


The above definitions are shown graphi- 
cally in Fig. 1 where we find: 


Yr, 18 the inverse slope of the generic line 
ny is its negative intercept on the n-axis 
wy is its inverse intercept on the t-axis 


zs) 
a | slope oS 


Nive 


Fig. l. The timing diagram for the gen- 
eric parallel computer, showing the defin- 
itions of the parameters, Yr, m, and 7. 


(From Hockney and Jesshope 1981, courtesy 
of Adam Hilger). 


It is useful to examine the values of 
r,, and n,, that are expected from the com- 
mon forms of computer architecture. This 
is done by considering the timing line for 
each type: 
(a) Serial Computer - the execution time 
is proportional to the number of ele- 
mental operations 


t = t4n (3) 


where t. is the time for one elemen- 
tal operation. 


(b) 


(c) 


Comparison with Eqn. (1) shows that 
for a serial computer 
r =t.t Hie 210 (4) 
co i Yo 
Pipelined Computer - the execution 
time is normally expressed by the 


manufacturers in a form similar to 


t = (st+kt+n-1)7 (5) 


where 


T is the clock period 

s is the startup time in clock periods 

2 is the number of segments in the 
arithmetic pipeline 


Comparison with Eqn. (1) shows that 
for a pipelined computer 
ee ae? ny = sti-l (6) 
co 72 
Processor Array - if there are N pro- 


cessors which simultaneously perform 
the same arithmetic operation on N 
elements of each vector ( one element 
of each vector in each processor's 
memory), then the timing graph is 
stepwise as shown in Fig. 2 
Ey [n/N] 


t = (7) 


where 


is the ceiling function of x, 
i.e. the smallest integer which 
is equal to or greater than x. 


t is the time for one parallel 


arithmetic operation of all pro- 
cessors in the array. 


The best straight line through the 
timing graph is the dotted line which 
corresponds to 
N/t 8 

/ D, (8) 


Lr = 
©o 


ny, = N/2 


This choice of parameters describes 
approximately the average behaviour 
of the array if the vectors presented 
to it are of varying lengths, more or 
less uniformly distributed. 

On the other hand one may know 
that the vector length is always less 
than the number of processors (nN) 
and that therefore one is always 
working on the first step of the tim- 
ing graph. In this case the 
behaviour is exactly described by the 
second generic form with 


=. 
7 = ea r ny, (9) 
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We note that this condition is the 
one assumed in the complexity theory 
of parallel algorithms: that is to 
Say that there are always enough pro- 
cessors. This can occur in general 
for the theoretical paracomputer 
which. has an infinite number of pro- 
cessors. It is nice that in our for- 
malism this theoretical limit occurs 
when Py": 


5 


t/t, 
3 


n/N 


Fig. 2. The timing diagram for an array 
of N processing elements (solid line), 
showing the best approximating generic 
straight line (dotted) which determines 
the value of nm, as N/2. (From Hockney and 


Jesshope 1981, courtesy of Adam Hilger). 


The above theoretical results for a 
range of widely different computer archi- 
tectures suggest that n, is a measure of 


the parallelism of the Computer hardware, 
varying from zero for a serial computer 
with no parallelism to infinity for the 
infinite array of processors. The excep- 
tion is the pipelined computer in which a 
large value of m, can occur either for a 


large amount of parallel operation in the 
pipeline (the number of segments 2 is 
large), or for a large value of the setup 
time , 8. In the former case n is 


measuring the hardware parallelism, but in 
the latter case it is measuring an over- 
head. From the users , or algorithmic, 
point of view the behaviour of the com- 
puter is determined by the timing expres- 
sion (1) and the value of n,, however it 


arises. A pipelined computer with a large 
value of n, appears and behaves as though 


it has a high levei of parallelism, even 
though this might be due to a long setup 
time. Hence we regard n,, as a measure of 


the apparent parallelism of the computer, 
and from the algorithmic {i.e. timing) 
point of view it simply does not matter 
how much of this is real. The fact that 
true parallelism and setup time are inter- 
changeable, incidently, shows that paral- 
lelism is an overhead, and therefore 
undesirable (by which we mean that paral- 
lelism is best avoided if at all possible, 


or that one should always seek to achieve 
the required performance with the least 
possible parallelism). 


The values of n,, and r, of a computer 


are best regarded as measured quantities 
obtained by executing the following FOR- 
TRAN code and plotting the timing graph of 
T against N: 


CALL SECOND(T1) 
CALL SECOND(T2) 
TO = T2-Tl 


DO 20 N = 1,NMAX (10) 
CALL SECOND (T1) 


pO 10 I = 1,N 
10 A(I) = B(I) * C(I) 


CALL SECOND (T2) 
20 T = T2 -Tl -TO 


In the above code, the DO 10 loop will be 
replaced by a single vector instruction by 
any vectorizing compiler. The measurement 
and subtraction of the timing overhead TO 
is essential because, as we have seen, any 
overhead will appear as a contribution to 
n,. In this case the overhead of measure- 
ment is nothing to do with the time of 


Anarak inn ann 
wees TAN RWELLG CAR LNA 
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must therefore be subtracted. 

The characterisation of the perfor- 
mance of computers by two parameters 
naturally leads to plotting computers as 
points in ‘the two-dimensional (ny r.,) 
phase plane, as is done for some well 
known designs in Fig. 3. In practice most 
computers may operate in different modes 
(scalar or vector, dyadic or triadic 
operations, different word lengths etc.) 
and therefore appear as a series of dots, 
joined to form a "constellation" in the 
diagram. The traditional characterisation 
of computer performance by the single 
parameter r,, corresponds to projecting 


this diagram onto, or viewing it through, 
the vertical axis. In the era of serial 
computers all of which have the same n, 


of zero, this was clearly valid. 
in the age of the parallel computer, it is 
obviously important to recognise the dif- 
ferent levels of apparent parallelism by 
spreading the computers out along the n 


axis. We call this the two-dimensiona 
spectrum of computers. We shall see in 
the next section that n, determines the 
choice of the best aigoki Gm: and hence is 
a very important axis. As examples, Fig. 
3 shows that the CRAY-1 (n,,*10) 


XY, but behave very differently because 
their values of n,, differ by a factor 10. 


For the same reason, the ICL DAP (ny~1000) 


2 
However 


and the 
CYBER 205 (n,#100) have similar values of 


differs from both 


the CRAY-l and CYBER 
205. 
1000 : = 
triads 
64' floating a 
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100 os triads ag 
CYBER 205 
i vecfor memory floating vector 
ops 
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computers, showing the CRAY-1, CYBER 205 
and ICL DAP. (After Hockney and Jesshope 
1981, courtesy of Adam Hilger). 


Parallel Algorithms 


To a first approximation an algorithm 
can be regarded as a sequence of vector 
operations of varying length (including 
one). Such a representation, of course, 
neglects many factors that may be impor- 
tant (even crucial) in particular cases. 
Such factors may be, for example, memory 
bank conflicts in pipelined computers, 
data routing delays in processor arrays, 
and the simultaneous operation of scalar 
and vector units. However we have to 
start somewhere and avoid too much compli- 
cation if we are to obtain manageable 
results. Therefore, in common with other 
theoretical analyses of algorithm perfor- 
mance, we shall assume such factors are 
unimportant and-express the total time, T, 
for the execution of an algorithm as 


_ imax 
T= 1 EL q,(p,+ny) (11) 
t-1 2 2% 2 


where we regard the algorithm as imax 
sequential stages, &£, each composed of qe 


vector operations of length P,- The gen- 


eric timing formula (1} is then used to 
buildup the expression (11). 


It is useful to define the following 
quantities: 


lmax 
qs 


q 
Q £ 


1 


number of vector 
operations, the parallel opera- 
tions' count or, in the language 
of complexity theory, the number 
of unit timesteps. 


the total 


the number of elemental opera- 


tions, or the traditional serial 
(scalar) operations’ count. 

p = 8/q 
the average vector length, or 


average parallelism of the algo- 
rithm. 


Using these variables the time of execu- 


tion of an algorithm can be expressed 
either as 
T = x2 a(Btny,) (12) 
eo 12 
where the algorithm is regarded as gq 


sequential vector operations with average 
vector length pb, or as 

T = r (8+myq) (13) 
where the first term is the contribution 
from the traditional count of all elemen- 
tal arithmetic operations, and the second 
term is the contribution from the number 
of parallel (i.e. vector) operations. 
Equation (13) demonstrates clearly the 
role of n, in interpolating between the 


extremes of the serial computer (n,,=0) and 
the infinitely parallel computer (n,=-). 


For serial computers only the first term 
or elemental operations’ count matters. 
For the infinitely parallel computer only 
the second term or the number of parallel 
operations matters. For computers with 
finite parallelism, a linear combination 
of the two operations' counts is appropri- 
ate, and the value of ny, Gives the weight- 


ing between the two. Since 


n,,=2 
corresponds to the assumptions made in Gre 
complexity analysis of parallel algorithms 
and q is the number of unit timesteps in 
such an analysis, equation (13) shows also 
how n,, interpolates rationally between the 
extreme assumptions that are used in com- 
plexity analysis and those that have trad- 
itionally been used in the analysis of 
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algorithms on serial computers. 

It is instructive to relate the quan- 
tities defined above to those introduced 
by Kuck [3] for the analysis of parallel 
algorithms. The most important of these 
is SPEEDUP which relates the speed of an 
algorithm on a parallel multiprocessor 
array to the speed of the same algorithm 
on a serial uniprocessor with the same 
speed arithmetic units. Thus 


SPEEDUP = (14) 


. _time of execution on uniprocessor 
ee 
time of execution on multiprocessor 


number of elemental operations 
number of parallel operations 


Ss = 


q 


That is to say the SPEEDUP is nothing 
other than the average vector length (or 
parallelism) of the algorithm. 

The use of the SPEEDUP factor as a 
figure of merit for parallel algorithms 
can be misleading because it is only one 
of several factors that must be considered 
in any comparison between a real parallel 


multiprocessor array and a real serial 
uniprocessor. Let us define the perfor- 
mance (or speed), P, of an algorithm as 


the inverse of its time of execution, that 
is to say rt, the number of executions of 


the algorithm that are possible _ per 
second. Then the relative performance is 
given by 
P T 8_xt 
aS Fe ee ee (15) 
P T gq xt 
Ss p p Pp 


where the subscripts s and p refer to the 
serial uniprocessor and parallel multipro- 
cessor respectively, and c. and t are 
respectively the time for a serial and a 
parallel operation. Equation (15} can be 
expressed as 


P s Fs} 
_P. Py, _® x (16) 
Ps Ip %p 


= SPEEDUP X algorithmic SLOWDOWN x 
xX hardware SLOWDOWN 
The first factor in equation (16) is the 


SPEEDUP factor previously defined, however 
the second and third factors are SLOWDOWN 


factors. In order for the paraliel mul- 
tiprocessor to outperform the serial 
uniprocessor, it is necessary that the 


product of the SPEEDUP and the SLOWDOWN 
factors be greater than one. It is not 
sufficient that the SPEEDUP factor alone 
be greater than one. The first SLOWDOWN 
factor, the algorithmic SLOWDOWN, arises 


because the definition of SPEEDUP assumes 


that the parallel algorithm is executed on 
the serial uniprocessor with an elemental 
operations' count of s_. Almost certainly 


an algorithm chosen for a parallel com- 
puter will not be the best on a serial 
computer, and the number of elemental 
operations in the best serial algorithm =. 


will almost certainly be less than By: 
8 
Hence the algorithmic SLOWDOWN factor 1 
(typically 1/5). = 
The second SLOWDOWN factor, the 
hardware SLOWDOWN, expresses the fact that 
if the multiprocessor and uniprocessor 
consume comparable resources, either in 
money , in number of chips, or in square 
millimetres of silicon, then the time to 
perform a serial operation on the unipro- 
cessor, ee wiil be much less than the time 


to perform a parallel operation on the 
multiprocessor, t_. In other words if you 


build many thousands of processors, each 
of them is going to be very slow compared 
with the speed of a single processor built 
or purchased with the same resources. 
Hardware SLOWDOWN factors are likely to be 


very small (=10°° to 1074). To take an 
extreme example, the CRAY-1l acts like a 
serial uniprocessor (small n,*10) and can 


produce an arithmetic result every 12.5 ns 
(=t.). On the other hand, the ICL DAP is 


a parallel array of 4096 processors and 
performs a parallel operation in about 250 
es (=t_). For these two computers’ the 


hardware SLOWDOWN is about 1/20,000. Tak- 
ing the two example SLOWDOWN factors, we 
see that the SPEEDUP might have to exceed 
100,000 before the parallel multiprocessor 
is likely to outperform the serial unipro- 
cessor. 

Traditional methods for comparing the 
performance of algorithms are based either 
on the assumption that the computers are 


serial, when we compare the elemental 
operations’ count, s; or on the assumption 
that the computers are array-like and 


always with sufficient processors, when we 
compare the parallel operations' count, q. 
We prefer to use the more general timing 
expression (12) or (13) and obtain a per- 
formance comparison for computers with 
finite values of n,. Suppose we compare 


the performance of algorithm (a) on com- 


puter (1) with algorithm (b) on computer 
(2), then 
slain talteae (a7) 
p(br2) (ard) 
(3) any?) q(®)) r{1) 0 (2) 
(a@Dan(Mgla) (2) gD 


In the above, superscripts are used to 


distinguish the computer or algorithm; and 
we note that the algorithm is specified by 
the value of s and q ( or § and q) and the 
computer is specified by values of n,, and 


r,, (or ny, and 7). The first two factors 


in Eqn. (17) come from the timing expres- 
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sion (13) and the last factor may be added 
if the cost of computer time is a relevant 
factor. C denotes the cost per unit com- 
puter time. 

Equation (17) is general and compares 
the cost performance of different algo- 
rithms on different computers. If, how- 
ever, we limit consideration to the choice 
of the better algorithm (in the sense of 
having the higher performance} on a par- 


ticular computer, then Eqn. (17) reduces 
to 
(a) 3 (in gq!) 
Eo rz) (18) 
p(d) sang) 


that the second and third 
factors in Eqn. (17) reduce to unity, and 
that the choice of the better algorithm 
depends only on the n,, of the computer and 
the s and q operations’ counts of the 
algorithms. 


in which we note 


In the comparison of algorithms, the 
ae erformance line along which 
P a) ap(P nlayse a key role because it 


Givides regions of phase planes in which 
algorithm (a) has the better performance 
from regions in which algorithm (b) has 
the better performance. Along the equal 
performance line we have 


ny, = ————_— (19) 
7: 
2 g68)-g 6) 


the left-hand side of which depends only 
on the computer and the right-hand side 
only on the algorithm. In general the 
operations’ counts s and g are non-linear 
functions of some quantity measuring the 
size of the problem being solved: for 
example the dimension, n, of the matrices 
in a matrix problem. The equal perfor- 
mance line (19) can then easily be drawn 
on the (n,,n) phase plane, because n,, is 


always an explicit function of n, albeit a 
non-linear one. The phase plane can 
thereby be divided into regions in which 
each algorithm has the better performance. 
Sometimes it may be desirable from the 
graphical point of view to scale the axes 
and plot, for example, the (ny/n,n) or 


(n n*,n) phase plane. It is a useful 


convention, however, always to choose the 
x-axis proportional to n,, the apparent 
parallelism of the computer. In this way 
serial computer algorithms always appear 
to the left of the diagram, and parallel 
computer algorithms to the right. 


Poisson's Equation 


In this section we apply the method 
of analysis developed in section III to 
the selection of the best member of a fam- 
ily of direct methods for the solution of 
the model Poisson problem. The problem is 
the solution of the 5-point difference 
approximation to Poisson's equation on a 
square nxn finite difference mesh with 
simple boundary conditions (either given 
value, gradient or periodicity). Such a 
problem may seem artificially simple and 
of little practical importance, however 
history has shown that there are many 
important problems in physics (plasma, 
astro-, and dense matter), electrical 
engineering (semiconductor device simula- 
tion) and meteorology that require espe- 
cially rapid methods for solving this 
problem (see, for example, Potter [4]; 
Hockney and Eastwood [5]). 

The method to be analysed is direct, 
and involves the optimum combination of 
Fourier analysis in the x-direction and 
block cyclic reduction by lines in the y- 
direction. The method is known as the 
FACR(£) algorithm, where &£ is the number 
of stages of line cyclic reduction that 
are performed before Fourier analysis 
takes place. It represents a family of 
algorithms because the parameter 2& can be 
used to minimise the time of execution. 
The first algorithm in this’ family, 
FACR(1), was published in 1965 by Hockney 
[6] working in collaboration with Golub. 


Subsequently the optimum value of 
£(~log,log.n) for serial computers was 
discovered empirically by Hockney [7], and 
the asymptotic form given later by 
Swarztrauber [8]. 

On parallel computers, it is 


interesting that the optimum value of fk 
depends not only on the size of the prob- 
lem, n, as it does on a serial computer, 
but also on the parallelism of the com- 
puter as measured by its half-performance 
length, n Hockney and Jesshope [1] have 


given the analysis for one way of imple- 
menting the FACR algorithm on a parallel 
computer which is most suitable for low 
levels of parallelism (the SERIFACR algo- 
rithm). Here we extend the previous work 
to a way of implementation that maximises 
the parallelism (i.e. vector length) of 
the algorithm and is most suitable for 
highly parallel computers (the PARAFACR 
algorithm). The reader is referred to the 
above book for a derivation of the opera- 
tions’ counts for Fourier analysis and 
cyclic reduction. We will quote these here 
and concentrate on the problem of finding 
the optimum value of &. 


SERIFACR Algorithm 
The FACR aigorithm involves’ five 


stages, and the variables that are related 
in each stage are shown for the FACR(1) 
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(a) &(b) (c) 


(d) 


Fig. 4. Data relationships in the 
FACR(1) algorithm. The arrowed lines join 
variables related by equations or an FFT 
during different stages of the algorithm. 
(From Hockney and Jesshope 1981, courtesy 
of Adam Hilger). 


(e) 


algorithm in Fig. 4. 
(a) 


The stages are: 


Modify RHS - block cyclic reduction 
by lines means the modification of 
the right-hand side of the Poisson 
equation on n2* lines, where 
r=1,2,...,%2. Vectors are run in the 
vertical direction,and are composed 
of corresponding variables in each of 
the n2.* lines. The vector length is 
therefore n2.". The numer of paral- 
lel operations is (3x2" ~+2)n, thus 
the time for this stage of the aige= 
rithm is proportional to 


r= 
to o= PF (n,tn2-*)(3x2% 442)n (22) 
a Vy 
r=1 
the factor rt is omitted in the 


above because, aS was seen in section 
III, it cancels out in any comparison 
of different algorithms on the same 
computer. 


(b) 


Fourier analysis - is performed on 
n2 lines in i patelser: Vectors are 


of length n2 are run vertically 
across the lines. The transforms are 
real and of length n and can be per- 


formed by the fast Fourier transform 
(FFT) in 27nlog n vector operations 


of length n2-*, hence 


t= (22) 


—2 
b (ny+n2 ) 27nlog yn 


Solve harmonic equations - n tridiag- 


onal equations, each of length n2 ", 
are solved for the n harmonic ampli- 
tudes. The vectors now run horizon- 
tally and are of length n. The tri- 
diagonal systems only involve vari- 
ables from the last lines modified in 
stage (a) and Fourier transformed in 
stage (b). The time of execution is 
proportional to 


(c) 


=e 
t. = 5(nytn)n2 (23) 


the coefficient five is appropriate 
for solution by Gauss elimination, 
taking into account that the immedi- 
ate sub- and super-diagonals of the 
tridiagonal matrices are unity, and 
that the main diagonal is a constant. 


Fourier synthesis - on the same lines 
as stage (b) gives the solution on 
these lines. The FFT is used in the 
same way as in stage (b) giving 


(d) 


=f 


Le 


j27pilog on 


t, = (myin2 
| Y 


Filling in - having found the solu- 
tion on every 2° line in stage (qd), 
fill-in takes place § recursively. 
Each level, r, requires the formation 
of a right-hand side ( 2 epenabsone | 
I7 


and the successive solution of 2 


(e) 


tridiagonal systems. Vectors run 
vertically as in stage (a), and the 
time is proportional to 
: =r red 
t. = (ny tn2 ) (5x2 +2)n (25) 


r=] 


Evaluating the sums in Eqns. (21) to (25) 
we find the total time of execution per 
mesh point to be proportional to 


ny, 
tseRiFAcR ~ ®t|7]9 vee) 


where 


44+4+(1+5logon)2 * 


w 
i 


49-84+8x2%4+5x2 * 


Q 
lt 


+5log on 


The equal performance line between 
the algorithm with &£ levels of reduction 
and that with 2+l is easily found to be 
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given by 


(1+51og,n)27 {4*4)_4 
ey RES Te (27) 
4+8x2*-5x27 (£t1) 


The form of Eqn. (27) suggests that a 
suitable parameter plane for the analysis 
of SERIFACR is the (ny/n 7n} phase plane, 


and this is shown in Fig.5. fhe equal 
performance lines given by Eqn. (27) 
divide the plane into regions in which 
£=0,1,2,3 are the optimum choices. Lines 
of constant value of n,, in this plane lie 


at 45 degrees to the axes, and the lines 


for n j= 16 ,128,2048 are shown dotted in 
Fig. 5. These lines are considered typi- 
cal for the behaviour, respectively, of 


the CRAY-1, CYBER 205, and the average 
performance of the ICL DAP. For practical 
mesh sizes (say n¢500) we would expect to 
use £=1 or 2 on the CRAY-1, £=0 or 1 on 
the CYBER 205, and &£=0 on the ICL DAP. 
The lower of the two values for &£ applies 
to problems with n<100. Temperton [9] has 
timed a SERIFACR(£) program on the CRAY-1 
and measured the optimum value. of £=1 for 


n=32, 64 and 128. This agrees with our 
figure except for =128 where Fig. 5 
predicts £=2 as optimal. This discrepancy 


is probably because Temperton uses’ the 
Buneman form of cyclic reduction (see 
Hockney [7]) which increases the computa- 
tional cost of cyclic reduction and tends 
to move the optimum value of £ to smaller 
values. 

For a given problem size (value of n) 
Fig. 5 shows more serial computers 
(smaller n,,) to the left and more parallel 


computers {larger n,,) to the right. We 


see therefore that e more parallel the 
computer, the smaller is the optimum value 
of 2. 

In the SERIFACR algorithm the vectors 
are laid out along one or other side of 
the mesh and never exceed a vector length 
of n. It is an algorithm suited to com- 
puters that perform well on such vectors, 


i.e. those that have n,‘¢«n, and/or which 
have a natural parallelism {or vector 
length) which matches n. The latter 


statement refers to the fact that some 
computers (e.g. CRAY-1) have vector regis-— 
ters capable of holding vectors of a cer- 
tain length (64 elements in the CRAY-1). 
There is then an advantage in using an 
algorithm that has vectors of this length 
and therefore fits the hardware design of 
the computer. For example, the SERIFACR 
algorithm would be particularly well 
suited for solving a 64x64 Poisson problem 
on the CRAY-1l using vectors of maximum 
length 64; particularly as this machine is 
working at better than 80 percent of its 
maximum performance for vectors of this 
length. On other computers, such as the 
CYBER 205, there are no vector registers 
and ny 100. For these machines it is 
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Figva 33% The (n,/n,n) parameter plane for 


the SERIFACR(£) algorithm. The solid 
lines delineate regions where the stated 
values of 2 lead to the minimum execution 


time. The dotted lines are lines of con- 
stant n corresponding to the CRAY-1 
(=20), BER 205 (=100) and the average 


performance of the ICL DAP (=2048). 


desirable to increase the vector length as 
much as possible, preferably to thousands 
of elements. This means implementing the 
FACR algorithm in such a way that the 
parallelism is proportional to n- rather 
than n. That is to say the vectors are 
matched to the size of the whole two- 
dimensional mesh, rather than to one of 
its sides. The PARAFACR algorithm that we 
now describe is designed to do this. 


PARAFACR Algorithm 


Each of the stages of the FACR algo- 
rithm can be implemented with vector 


lengths proportional to n°”: 


(a) Modify RHS - at each level, r, of 
cyclic reduction the modification of 
the right-hand side can be done in 
parallel on all the n“2-* mesh points 
that are involved. Hence the timing 
formula becomes 


_ z 2a>r r= 1 2 2 
to > (nytn 2 )(3x2 +2) (28) 


r=1 


(b) Fourier analysis - The n2-* 
transforms of length n are performed 
in parallel as in SERIFACR, but now 
we use a parallel algorithm, PARAFT, 
for performing the FFT with a vector 
length of n. The vector length for 


all lines becomes n“2-* and the tim- 
ing equation is 


_ 2,72 
t= (nyt 2 )4log yn (29) 


The factor 4 replaces the 2% in Egn. 
(22) because extra operations are 
introduced in order to keep the vec- 
tor length as high as possible in the 
PARAFT algorithm (see Hockney and 
Jesshope [1], page 315). We also 
note that the factor n has_ moved 
inside the parentheses in comparing 
Eqn. (22) with (29), because the vec- 


tor length has increased from n2-* to 
n“2> 


(c) Solve harmonic equations - the har- 
monic equations are solved in paral- 
lel as in SERIFACR, but we use a 
parallel form of scalar cyclic reduc- 
tion, PARACR, instead of Gauss elimi- 
nation for the solution of the tridi- 
agonal systems (see Hockney’ and 
Jesshope [1l], page 289 ). For the 
special case of the coefficients pre- 
viously noted, there are 3 paraliel 
operations at each of logon levels of 


scalar cyclic reduction. The vector 
length is n*2* giving 


a 2,-2% 
re = (nytn 2 )31log5n (30) 


(d) Fourier synthesis —- as stage (b)} 


a 2,-2 
ta = (nytn 2 )41log5n (31) 


(e) Filling in - at each level, r, n2~ 


tridiagonal systems of length n are 


to be solved. Using PARACR as in 


stage (c) the vector length is n?2-F. 


Afterwards a further two operations 
are required per point which may 
also be done in parallel giving 


2 r-l 


r 
t. = © (nytn 2°") (3x2 


log,n+2) (32) 
e oo4 -) 2 


The time per mesh point for the PARAFACR 
algorithm is therefore proportional to 


-2 ¥, 
¢t = —— of 
nN "paRAFACR ~ 571~5]/4 (33) 
n 


where 


s Yo 3 log nt1) £+4+(1llogjn-4)2 * 


q'' = 48+(3log,nt1)(2*-1)+11log,n 


The equal performance line between the 
level £ and £+1 algorithms is given by 


n 
yes 
2 : 
n 
(1llog,n-4) 27 (**7)-¥310gn4+1) 
4+(3log n+1)2* 

10.000 

1000 

n L=2 f= L=0 

100 ke 


0-01 
0-1 N19 1 10 
n2 
Fig. 6. The (n Jn? .n) parameter plane 
for the PARAFACR(®) algorithm. Notation 


as in Fig. 5. 


The form of Eqn. (34) leads us to choose 
to plot the results for the PARAFACR algo- 


rithm on the (n,/n’',n) parameter plane, 


and this is done in Fig. 6. We find that 
the equal performance lines are approxi- 
mately vertical in this plane, and con- 


clude that f=2 is optimal for ny <0. 1n ' 


f<1 for 0.1n<n,<n* and, #=0 for n,n“. 


. /2 
There are no circumstances when more than 
two levels of reduction are worth while, 
thus justifying our use of the unstabil- 
ised FACR algorithm (see Hockney and 
Jesshope [l],page 348). In particular, 
for a processor array with as many or more 
processors than mesh points (Nsn"’, we 
take n,=o sand find £=0. This case 


corresponds to the solution of a 64x64 
problem on the ICL DAP which is an array 
of 64x64 processors. The dotted line for 
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ny=100 is shown in Fig. 6, corresponding 
to the CYBER 205. For all but the smal- 
lest meshes (i.e. for n?30)}) we find 1l=2 
optimal. The line for n,=20 is also 


given, from which we conclude that 2=2 is 
optimal in all circumstances if this algo- 
rithm is used on the CRAY-1. 


SERIFACR/PARAFACR Comparison 


So far we have considered the choice 
of the best value of & for each algorithm. 
Having optimised each algorithm we now 
consider which is the best algorithm to 
use. This is Gene by plotting CSERIFACR 
and ct DARAFACR against is n} for a series 
of values of n, in order to determine 
approximately which algorithms abut each 
other in different parts of the parameter 
plane. One can then calculate the equal 
performance line between PARAFACR(£)}) and 
SERIFACR(£') from 


n 
ee = azb (35) 
n c-d 
where 
a = ¥4(3log ntl) £+4+(11log n-4)2 * 
5 q 
b = 42'+4+(145logjn)2 * 
g' — 4! 
c = 4£'-8+8x2" +5x2 +Slogon 
4#+(31log,n+1)(2*-1)+111o0g4n 
a = ann nena enna 


n 


The interaction of the two algorithms is 
shown in Fig. 7 on the (n ‘ee parameter 
plane. This division between the two 
algorithms is about vertical in this plane 
showing that SERIFACR is the best algo- 
rithm for smaller n,,<0.4n (the more serial 


computers), and the” PARAFACR is the best 
for larger n,,>0.4n (the more parallel com- 


puters). Lines of constant n, are shown 


for the CRAY-1l and CYBER 205. e conclude 
that SERIFACR should be used on the CRAY-1 
except for small meshes with n<64 when 
PARAFACR(2) is likely to be better. On the 
CYBER 205, PARAFACR is preferred except 
for very large meshes when SERIFACR(2) 
(300<n<1500) or SERIFACR(1) (n>1500) is 
better. 


Conclusions 


The optimum choice of algorithm for 
the solution of Poisson's equation on a 


PARAFAC R(2) 


100 


Fig. 7. Comparison between the SERI- 
FACR(2) and PARAFACR(£), showing the re- 
gions of the (n,/n,n) parameter plane 


where each has the minimum execution time. 


parallel computer is found to depend on 
the ratio of the parallelism of the com- 
puter (as measured by its half-performance 
length) to the size of the finite differ- 
ence mesh. Two implementations of the 
FACR(2£) algorithm have been considered, 
and in both cases we conclude that less 
cyclic reduction (lower 2) should be per- 
formed the more parallel is the computer. 
We find that the implementation with the 
smallest vector length (or algorithmic 
parallelism) SERIFACR, is most suitable 
for computers with low hardware parallel- 
ism (i.e. the more serial with low nj), 


and that the implementation with the long- 
est vector length PARAFACR, is most suit- 
able for computers with high hardware 
parallelism. 

The above conclusions are based on the 
simplifying assumptions given in_- the 
introduction. The best practice is to 
write a program for both algorithms with 
variable £, and determine empirically the 
optimum algorithm and level of reduction. 
Our graphs can be a quide. 
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PARALLEL POISSON AND BIHARMONIC SOLVERS 
IMPLEMENTED ON THE EGPA MULTIPROCESSOR 


Marian Vajtersic 
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Bratislava, Czechoslovakia 


Abstract 

In this paper the use of the EGPA 
(Erlangen General Purpose Array) compu- 
ter system of the MIMD (Multiple Instrue- 
tion ~- Multiple Data) mode of parallelism 
for solving the model slliptic boundary 
value problems of the second and fourth 
orders is presented. A direct method for 
solving the Poisson equation and a semi-~- 
direct biharmonic solver are structured 
for parallel execution on this hierar- 
chical multiprocessor system. Beth the 
computational and intercommuniocation re- 
quirements of the system are taken into 
account in order to minimize the trans- 
fer and synchronization steps required. 
Both algorithms considered have been ac- 
tually run on the EGPA parallel computer 
with considerable speed ~- ups in compari- 
son to sequential exeoution. 


Introduction 


From the advent of parallel compu- 
ters, attention has been paid to desig- 
ning efficient algorithms for the nume- 
rical solution of the boundary value 
problems for elliptic partial diferen~ 
tial equations. Among them parallel algo- 
rithms for solving Poisson and biharmonic 


This work was pursued during author’s 
stay at IMMD, Erlangen - Nurnberg Univer- 
sity, in 1981, under the research fellow-~ 
ship of the Alexander von Humboldt Foun- 
dation. 
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equations are frequently discussed in 
studies concerning parallel numerical al- 
gorithus [e.g. 1,2]. Especially, parallel 
Poisson and biharmonic solvers for SIMD 
(Single Instruction - Multiple Data) ma- 
chines are well developed [3]. However, 
in most of these algorithms the number 
of processors required is equated to the 
number N* of disorete solution values in 
the domain. Despite the fact that there 
exists a machine with large number of 
processors [4] and the next one [5] is 
being developed, on parallel systems cu- 
rrently available it is often not possib- 
le to meet this demand for solving large~ 
scale (N=128) problems arising in prac- 
tice. The most recent comparison of ac- 
tual performance of preeessor arrays ICL 
DAP and Borroughs BSP, as well as of 
three pipeline machines, on Poisson sol- 
ving has appeared in [6]. 

On MIMD systems currently in opera- 
tion the number of processors is not 
greater than 50 [7]| and therefore the al- 
gorithms should be modified for practical 
computation. Some iterative methods of 
Jacobi type have been tested on the Cmump 
and Cm* systems [8,9]. The mesh points 


were divided into portions, each of them 


assigned to one process. The experiments 
have shown the best performance for the 
purely asynchronous method where the ite-~- 
ration values are evaluated without any 
need for synohronization. However, most 
of the Poisson solvers are synchronous 
and therefore significantly more aiffi- 
cult to implement efficiently. The per- 


formance is strongly influenced by syn- 
chronization of the process and by tran- 
sferring the intermediate results bet— 
ween the processors. 

We have implemented two synchronous 
algorithms on the EGPA system [10], which 
is one of the operating MIMD computers in 
the world. The current configuration of 
this hierarchically organized nultipro- 
cessor represents an elementary pyramide. 
The aim in developing the system was to 
assure a high rate of flexibility of o- 
perating modes. Some characteristics of 
the system are described in the follo- 
wing section. 

In the third section description is 
given of the two algorithms selected for 
implementation. The Poisson equation was 
solved by the direct method using the de- 
composition property of the matrix which 
results from the discretization of the 
problem [a4]. The biharmonic problem sol- 
ved by the semidirect method [12] based 
on splitting the fourth-order elliptic 
operator into two operators of the se- 
cond order is the second one under con- 
sideration. The algorithm proceeds ite- 
ratively where solutions of the two Poi- 
sson equations are to be computed in one 
iteration. 

These fast sequential algorithms ha- 
ve been tailored to the EGPA considering 
the connections between the processors. 
The allocation of the sub-tasks to the 
processors affects the anmount of infor- 
mation transferred and the synchroniza- 
tion steps required. This strategy is 
described in the fourth section of this 
paper. 

The execution results are summari- 
zed in the final section. The efficien- 
ey of the perfomance defined there is 
problem dependent and for sufficiently 
large disoretization parameters N the 
efficienoies are high enough te support 


the conolusion that EGPA is a viable al- 
ternative to sequential computer for the 
mumerioal solution of the elliptic equa- 
tions considered. 


Characteristios of the EGPA system 


The multiprocessor EGPA has been 
developed and run at Erlangen Universi- 
ty. It reflects the idea formulated by 
Handler, Hofmamand Schneider in [13]. 
Its essential features are an extensible 
hierarohical structure composed of pyra- 
midal cells and more operating modes u- 
sing, as far as posible, commercially a- 
vailable hardware. The set of operating 
modes involves array-processing, assecia- 
tive computation, data-flow approach, mi- 
cropipelining, multiprocessing as well as 
sequential computation. On the Figure i 
a model of the 3j-level EGPA-pyramide [14| 
is demonstrated, where one circle corres- 
ponds to a processor and its associated 
memory. The arrows illustrate unidireo- 
tional connection between a processor 
and the memory of its neighbour while 
the other lines represent bidirectional 


connections. 


Figure 1. 

The current configuration consists 
of five processors which form a pyramide 
with four slave processors in the base 
(further A processors). The one proces- 
sor in the head of the pyramide (B pro- 
cessor in the following text) is either 
controlling a computational process or 


computing a sub-task of the problem like 


the processors in the base. 

The processors are 32-bit control 
computers of the AEG 80-60 type. Each 
processor is microprogrammable and ean be 
used for asseeciative processing with ad~ 
ditional hardware. The experiments with 
vertical processing [14] show promising 
acceleration factors, 

The connections between processors 
are realized through the multiport memo-~- 
ry. The precessors of the array are nei- 
ghbour connected in both directions. The 
processor in the top can access the me-~- 
mories of all the slave processors while 
the access in the opposite direction does 
not exist. 

Due to the intereonnections the data 
transfer between the diagonal processors 
is a bottleneck. In numerical applications 
it is often the case that intermediate re- 
sults should be available to each proce-~ 
ssor after a synchronization step. One 
way how to evercome it lies in transpor~ 
ting the data via the left or right neigh- 
beurs, as desoribed in the fourth seotion 
of this paper. However, each transport of 
evaluated results is time consuming and 
therefore the ceoperation of several pro- 
cessors executing one task is only use- 
ful if each processor works on large in- 
dependent data quantities. For more de- 
tailed information concerning the EGPA 
system we refer to [10]. 

Before putting down the parallel 
program version of an algorithm on EGPA, 
the program for the sequential execution 
is written in the ALGOL-68-like progran- 
ming language SL3. The program for parale 
jel execution where each processor execu- 
tes an independent process is then rewri- 
tten from the sequential program using 
special instruction set of the EGPA-Moni- 
tor [1 5] - Generally, a special programming 
module should be formulated for each pro- 
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cessor. Such module contains a set of 
procedures to be executed as well as de- 
elarations of variables. The variables 
are of looal and global types. The varia- 
bles of the former type are local in the 
main program or in one of the processors 
executed while the latter ones are used 
by more processors and are located in 
special memory segments. If the proces- 
ses executed on all A processers are the 
same, as it has been the case in our ap- 
plications, only one program module for 
all processors has to be written. Howe- 
ver for each process on the A processors, 
ene special module is to be formulated to 
ereate control segments where the corres-~ 
ponding global variables oan be declared. 
In order to identify the processors if the 
identical processes are run on all proce- 
ssors, there is a variable for actual 
processor number in the program module to 
be declared. 

The main program for the B processor 
encompasses procedures for initiallsatien, 
execution, synchronization and termination 
of the whole computational process. 


The Poisson and Biharmonic Solvers 


The Dirichlet problem for Poisson 
equation in two dimension%for an unknown 
function u and for given funotions f and 
& 

u,, + Wy =f inkR 


= g onR 


: (1) 
on a unit square R with the boundary R is 
considered. After disoretizing the domain 
in both directions by a step of size 
(N+1)"' gor an integer N, the second or-~ 
der derivatives in (1) ean be approxima- 
ted by finite difference formulae. The 
resulting sparse linear system of equa- 
tions is : 


(2) 


Mu = w 


where the vector u contains n* values of 
u to be evaluted in the interior grid 
points. The matrix M is of known blook- 
tridiagonal structure M = (-I,T,-I) with 
tridiagonal blocks T = (-1,4,-1) and with 
identity matrices I of order N, 

The algorithm of matrix decompositi- 
on [11]is adopted to solve (2) making ad- 
vantage of the property 


T = vov- 
where 
_ | 2 iix 
Vez (V, 5) = Nel sin Ne?’ 
i,j = 1,...,N 
and 
D = diag (d,,do,...,dy) with 
LW 
= o- 2 emanaanze 
a4, =4 cos T 
i = 1,...,N 
are, respectively, the eigenvector and 


eigenvalue matrices of T. Defining the 
matrices 


Ty = (-1,d;,-1) i 1,.00,N; 
the steps of the algorithm are as fol~ 
Lows: 
PO. Evaluate the right-hand side vector 
woof (2). 
P1, Compute VW = Y, 
where Wis a matrix representation 


of w. 

P2. Solve for is LyeoeyN 
Ta, = Yi | (3) 
where Y, is i-th row of Y and 
tT; = (~1,4,,-1), is Pereey: | 

(4) 

are of order N. 

P3. Compute VU = U 


where the i-th row of the matrix U 

is u, transposed for i = 1,...,N. 

A fast algorithm for solving the bi- 
harmonic equation 
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Vrexx + WAyyy + Uy, 2 F in R, 


the boundary conditions being 
Uy, = Eo» 
is proposed in [12] - One iteration of 
the algorithm proceeds by 


a (att) = [ 2w = 2(tou)?0 (we nya 


2 (m= 1) (6) 


-w wu 
where the N* order matrix H is a sparse 
diagonal one of the form 

Hs diag (I+H,,Hp,...,Hg:ItHy), with 

Hy = (1,0,...,0,1). The vector d is con- 
puted from constant vectors b and e which 


(5) 


u = €4) on R 


+d 


result from the discretization of the gi- 
ven functions of (5). The parameter w is 
necessary for ensuring the convergence 
of the process. 

The iterations (6) are computed un- 
til 
uiuet) » u{s) =< € 


(7) 


max 
Ld, jut, 200g N 
becomes valid for a prescribed acocura- 
cy € . The iterative solution u of (5) 
is then obtained by 


we Fu (att) 


Mat) ssaite econ (7) aud Ais 
blookdiagonal matrix F = diag(V,V,...,V) 
is of order yn’. 

The algorithm preceeds in the follo- 
wing steps: 


(8) 


where wu 


m <— 1 
BO, Evaluate the constant vectors b 
and ec and compute 
a= Fev'p 4 ruv! (wn te). 
set ul) = 0, ul!) = umn, 
LAB: m <— m+ 1 | 
B1, Caleulate w'™) = aru(™) taking 
advantage of the sparsity of H. 
B2. Compute w = Fw(™) trom the 
sparse vector w(™) and order it 
into a matrix W. 
B3. Solve T,¥, = W, where Ww, is 


i-th row of W and T; is as 


defined in (4). 

Selve Ta; = yy for U,» 

i = 1,2,...,;N which are compo- 
nents ef the vector M~!(M~'H)u 
from (6). 

Evaluate uy (mt) by (6) and exe- 
cute the test (7). 

Repeat from LAB if (7) is not 
valid. 

Evaluate u by (8). 


(m) 


B5. 
B6. 


B7. 


Implementation of the algorithms 


To execute both the algorithms on 
EGPA we have tried to split up the con- 
putational task into four portions among 
the four processors only. The role of the 
B processor is to supervise the process 
and to realize the output of evaluated 
results. For this reason it will be fur- 
ther supposed that Nis a multiple of 

4 . For explanation purposes let us 
distinguish the processors on the A le- 
vel by AJ, J = 1,2,3,4 as shown in the 


Figure 2. 
(a3) (a2,A4) 


(43,41) (a4) 


(a4,a2z) (at) (a2) (41,43) 
Figure 2, 


The left and right neighbours of 
each processor AJ defined by AJL and AJR 
respectively are given in parentheses. 
The first term corresponds to the AJL 
while the second one to the AJR for 
J = 1,...,4 To identify values of a 
global variable x, we shall respectively 
write x(J), x(JL) and x(JR) according to 
the definition of processors AJ, AJL 
and AJR. Further, for a more convenient 
formulation of the algorithms, some pro- 
cessing functions are introduced. 

For the synehronization, use is ma- 
de of Wait (C), where the condition 
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expressed by C must become valid before 


proceeding to the next step of the al- 
gorithm. In practice, the processor is 
asking during its active waiting state 
for response to whether C is being ful- 
filled or not. For the transport phase 
realized by processor AJ, where a quar- 
ter of array X located on the processor 
AJR is transported into the memory of 
AJL, Trans (X,,) is used. If the value 
of a variable x is to be transported in 
the same way, the statement Trans (x) is 
used anrzlogously. 

In the program formulated for pro- 
cessor AJ, Execute (S) will denote exe~ 
cution of the whole task of step S. In 
contrast, by Execute,(S) only the J-th 
quarter (we recall J = 1,...,4) of the 
task in step S will be executed by the 
processor AJ. 

Tuzmmning back to the algorithm for 
Poisson equation, one can observe that 
steps P1 and P2 are independently sol- 
vable on ali four processors. Indeed, for 
N being a multiple of 4, the matrix pro- 
duct of step P1 can be performed in four 
subtasks partitioning the matrix V hori- 
zontally into four rectangular matrices 
Vy, J = 1,...,4, having the same type. 
However, before calculating the matrix 
products VW = Yz on four processors 
concurrently, the constant matrix W 
sheuld be evaluated by each processor 
completely. Without any cozmunication 
between processors, in step P2 the rows | 
of Y; for each processor AJ are only ne- 
eded for evaluation of U,; which is row- 
wise structured from N/4 solutions U, 
of (3), where i = (J=-1)N + k, k=1,...,N. 

In order to compute the matrix mul- 
tiplication in step P3 in parallel, the 
complete matrix U must be available to 
each processor. Hence, after step P2 
synchronization and transport of quar - 
ters U; of U between diagonally positio- 


ned processors should follow. The syn- 


chronization of the process is realized 
by setting the value of the variable ber- 
fertig on processor AJ on true after this 
processor has finished its job on steps 
Pi and P2. Each processor AJ is waiting 
actively until its right neighbour is 
ready in order to be able to transport 
the array U,, from AJR inte the memory 

of AJL. If this transport is realized 
the value for the variable transfertig 
on AJL is reset on true. However, it may 
eccur that the left neighbour AJL has not 
yet finished the execution of steps P1 
and P2 or even the processor AJ has not 
ebtained the portion of U which should 
be transported to it by AJR from its 
diagonally located processor. 

In both cases it is necessary to 
wait in order to realize the transport 
phase of the algorithm completely. Then 
under the same principle as in step P1, 
step P3 ean be calculated in parallel. 
Processor B realizes output of the re- 
sulting matrix U in the finishing stage 
of the EGPA Poisson Solver (EPS) algo- 
rithm which can be fornmulated for para- 
llel execution on four processors AJ, 

J = 1,...,4 from the viewpoint of one 
processor AJ as follows: 
berfertig ~«<— false 
transfertig <— false 
Execute (PO); 
Execute ,(P1, P2); 

berfertig (J) <— true 
Wait (berfertig (JR) <— true); 
Trans (O5n)s 

transfertig (JL) «— true 
Wait (berfertig (JL) <— true); 
Wait (transfertig (J) <— true); 
Exeoute 7(P3). 

The algorithm for the biharmonic e- 
quation is of an iterative structure and 
is therefore more demanding on synchro- 
nization, In the preprocessing phase the 
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right-hand side vectors b and c are to 
be evaluated on each processor. Accor= 
ding to the previous explanation of the 
EPS algorithm, the evaluation of the 
vector d can follow in parallel whereby 
each processor evaluates the correspon- 
ding quarter of the veotor., 

Attention is fooused to the itera~ 
tive section of the algorithm which is 
crucial from the viewpoint of implemen- 
tation efficiency. We introduce a global 
variable m where the number of curren- 
tly evaluated iteration is being stored. 
The first synchronization phase is in 
the beginning of each iteration in order 
to start the new iteration after each 
processor has finished the evaluation of 
preceding iteration values. It is reali- 
zed by waiting AJ on AJR, i.e. until 
m(J )= 
of m after which the processor AJ is 
waiting until m(J) = mdiag(J) and 
m(J) = m(JL). Here, mdiag(J) is the va- 
lue of the variable m transported from 


m(JR) and by transport of values 


the diagonally located processor to the 
processor AJ, The computation starts 
with the evaluation of the complete 
"window" HFu\™) on each processor in or- 
der that parallel multiplication by Vy 
The twofold elimination 
phase is to be performed analogously as 


be executable. 


in the previous algorithm. Having stored 
its portion of preceding iteration valu- 
es, each processor can also calculate in- 
dependently corresponding new iteration 
(m+ 1) by ( 6). 

Since each processor AJ can evalua- 


values u 


te the maximum value max of differences 
of two successive iteration values in 
its corresponding quarter of grid points 
only, there is a problem how to decide on 
the termination of the process (6) by 
(7). For this reason a global boolean 
variable iter has been defined and ini- 
tialized on each processor by false in 


the beginning ef the iteration. Its va- 
jue is reset on true on those processors 
where max is greater than a givenée . 
After the evaluation of max on processor 
AJ the value berfertig (J) is inoreased 
by 1 and followed by synchronization and 
subsequently by the transport of new i- 
teration values as well as of iter values 
between diagonal processors. The value 

of the synchronization variable trans- 
fertig is increased by 1 in the left 
neighbouring processor AJL. 

The values of the iter variable 
from each precessor being available to 
all of them, the decision from these va- 
lues is made whether the iteration will 
continue or the final computation with 
output of results should follow. The fi- 
mal matrix multiplication is performed on 
A processors while the output of results 
is made through the B processor. 

The programming soheme for parallel 
exeoution of one iteration of the EGPA 
Biharmoniec Solver (EBS) algorithm for 
each processor AJ is the following: 

LAB: m<— m+ 1 


Wait (m(J) = m(JR)); 

Trans (m(JR)); 

Wait (m(J) = mdiag (J) and 
m(J) = m(JL)), 


iter <— false 
Execute (B1); 
Exeoute 7(B2,B3,B4,B5) ; 
if max—e then 
iter (J) <— true 
berfertig (J) <— berfertig (J)+1 
Wait (berfertig (J) = 
= berfertig (JR)), 
Trans (u(@)), 
Trans (iter(JR)); 
transfertig (JL) <—transfer- 
tig (JL)+1 
Wait (transfertig (JL) = transfer-~ 
tig (J) and berfertig (J) = 
= berfertig (JL)); 
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if iter (J) or iterdiag (J) or iter 
(JL) or iter (JR) then repeat from 
LAB. 


Results and concluding remarks 


The two algorithms of the previous 
section were executed on EGPA. Also the 
sequential versions described in third 
section were run on a single precessor AEG 
80-60. The speed-up for parallel imple- 
mentation of the algorithm against the 
sequential one is defined by 


ond, 
t 
P 


where t. is CPU time of the serial exe- 
cution and ty corresponds to the parallel 
one. The assumed efficiency of implemen- 
tation is evalueted in terms of s by 


8 
e= = 100 % 


where m is the number of processors par- 
tieipating in computational work. The 
time values given in Tables 1 and 2 are 
CPU times measured for the whole com 
putational process not including the ti- 
mes for the output of results. Since 
both the algorithms EPS and EBS have 
been tailored to employ four A proces- 
sors on arithmetical and transport ope- 
rations, n = 4 is considered in the ef- 
ficiency results. In spite of this the 
results for n 5 due to the 5~-proces- 
ser configuration of EGPA are also given 


in parentheses, 

The results for solving the Poissen 
equation for the function x*+y* are sUun— 
marized for various N in Table 1. The 
speed-up oan be seen to increase in de- 
pendence on the problem size. It illus- 
trates the influence of the synohroniza- 
tion and transport phase on the effi- 
ciency which increases apparently when 
the independently solvable arithmetical 
tasks become dominant for large N. Sin- 


ce the steps of the algorithm are highly 
parallel there arises the question why 
the speed-up ratio does not tend more 
closely to the ideal value of 4. The 
main reason is that all the A processors 
evaluate the complete vector w sinultane-~ 
ously, i.e. the speed-up for this stage 


Pe em = [en 
ps] omen | mam | oe | orm 


of the algorithm is 1 only. (Of course, 
it might be possible to divide the task 
of evaluating w into four processors 
but with additional costs for synohro- 
nization and transport of portions 

of w.) 


Table 1 e 


The biharmonio equation (5) was sol- 
ved for the unknown function x3o3y7*+2xy 
for more values of € oma variety of 
erids. The iteration parameters w used 
are estimated as proposed in [16]. It ean 
be seen from Table 2, where the number of 
iterations m corresponds to the accuracy 
¢ = 0,001, that the results for the algo- 
rithm EBS are less dependent on parame-~- 


GP fem [oem [oT ven 
cafe] om | sem [ios | te oo 
elo] seme [an [at | a0 Oe) 


ter N than in the preceding algorithn,. 


the other hand, the efficiency re- 
sults aohieved are slightly worse than 
in the algorithm EPS even for large N. 
It is due to the iterative nature of 
the algorithm where two synehronization 
and transport phases within one iterae 
tion affect the speed-up considerably. 


Table 2, 


It should be investigated whether a five- 


processor realization would bring an im- 


provement of s and e for both algorithunus. 
It would also be interesting to know 


whether the synchronization of the A pro- 
cessors performed through the B proces- 
sor would yield better results or not, 

We note that there is also a chance to 
use the FFT reutine instead of a classi- 
eal matrix product precedure in the al- 
gorithm EPS. Hewever, in such approach 
the performance strategy has to be modi-e 
fied because the transform by V instead 
of multiplication by Vy must be realized 
on eaoh precessor. Hence, the synchroni- 
zation and transport should be performed 
twice compared to its being performed on- 
ce in our strategy. 

The results of experiments indicate 
that both algerithus ean be efficiently 
implemented on the EGPA system. Some 
other experiments and analyses also sup~ 
port the view that EGPA, as a generally 
oriented multiprocessor, appears to be 
adequate for solving a broad variety of 
non-numerical as well as numerical prob- 
lens. 
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ITERATIVE ALGORITHMS FOR TRIDIAGONAL MATRICES ON 
A WSI-MULTIPROCESSORTt 
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Abstract -- With the rapid advances in semi- 
conductor technology, the construction of Wafer 
Scale Integration (WSI)-multiprocessors consisting 
of a large number of processors is now feasible. 
We illustrate the implementation of some basic 
linear algebra algorithms on such multiprocessors. 


1. Introduction 


The need has always existed for fast and effi- 
cient numerical computations that accurately and 
truly simulate physical processes. The VLSI tech- 
nology increased that demand by making numerical 
computation affordable. This, in turn, further 
expanded their popularity and application. The 
computers used for this purpose usually come in 
three forms: 


(a) High-speed parallel and/or pipelined ma- 
chines. These are general-purpose, expensive ma- 


chines (CRAY-1, CYBER 203/205, NASA's Numerical 
Aerodynamics Simulator, and S-1) optimized for 
high performance (usually using special packaging 
technology, cooling system, and on the leading 
edge of technology). 


(b) Attached array processors. These are less- 
general, order-of-magnitude less costly machines 


with high-speed arithmetic, designed to perform 
well on certain applications (FPS AP series). 

They are attached to a host processor, which sup- 
plies only numerically-intensive special tasks to 
the attached processor. The attached processor 

is fully programmable but knowledge of its archi- 
tecture is necessary if the machine is to be prop- 
erly exploited. 


(c) Special-purpose co-processors. These are 
special-purpose, low cost "black boxes" that can 


execute efficiently a few well-defined problems. 
They are minimally programmable and they are used 
as arithmetic accelerators (Intel's 8087), as 
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special library subroutines (FFT, triangular 
solvers, filters). 


In this paper we are interested in systems of 
the third type. Here, one or more special-purpose 
accelerators attached by shared bus to the low- 
cost, low-speed, general-purpose host can trans- 
form the host into a powerful number cruncher for 
a specific application. These special applica- 
tions may be real time applications, or just some 
frequently used algorithms that we would like to 
speed up. The addition of one or more special- 
purpose accelerators, at $100-$200 each, may im- 
prove performance, resulting in performance-cost 
ratios that are not available with any other 
general-purpose system. 


The WSI Model 


VLSI technology has brought us the capability 
of increasing processor speed, size and complexity 
of several orders of magnitude. However, it needs 
a hierarchical and regular design, since the old 
von Neumann model is difficult to upgrade for the 
emerging WSI (Wafer Scale Integration) technology. | 
This has influenced the search for new models of | 
computation and machines. In addition to the con- 
straints imposed by packaging and manufacturing 
technology, a WSI model must satisfy the following 
requirements: 


(a) a few types of simple processor memory mod- 
ules replicated throughout the wafer; 


(b) 


regular communication network between mod- 
ules with a constant number of crossovers; 


(c) 
(d) 
(e) 


I/O ports on the boundary of the wafer; 
asynchronous communication among modules; 


I/O rate independent of the size of the 
problem; and 

(f) high level of fault-tolerance. 

The most popular VLSI model for numeric com- 


putations, the systolic array model introduced by 
H. T. Kung [KuLe79] satisfies our conditions (a), 


(b), and (c). It implicitly assumes a global syn- 
_chronizing clock, even though it can be adapted 
for asynchronous communication without loss of 
generality. The model, however, does not satisfy 
conditions (e) and (f); it assumes as many I/0 
ports as needed, that is, an I/O rate that grows 
with the size of the problem. Furthermore, it 
assumes an adequate memory (outside of the sys- 
tolic array) that stores all the data, and an 
environment that fetches the data (possibly in 
parallel) from the memory and supplies them at the 
proper moment to the systolic array at no cost. 


Moreover, the systolic array model assumes 
that all processors in the array are good, and if 
one processor becomes faulty the whole systolic 
array becomes faulty. This implies that the sys- 
tolic array does not satisfy the definition of our 
WSI model since each processor must be a separate 
die, and has tested fault-free before it is as- 
sembled into a systolic array. A realistic model 
must allow for any number of faulty processors in 
the array. An occurrence of a fault should cause 
only degraded performance but not cause the system 
to break down. | 


A multiprocessor array model satisfying cri- 
teria (a) through (f) is shown in Fig. 1. The 
model is an array of identical Switch-—Processor- 
Memory (SPM) modules. Each processor P communi- 
cates with its local memory M and with other pro- 
cessors (and memories) through the switch S. Each 
S is a 5X5 crossbar and communicates asynchron- 
ously with four neighboring switches. The entire 
switch array operates in circuit switching mode. 
Since there are only four possible paths from any 
input port (straight ahead, left, right, and to- 
ward processor), only two bits of information are 
needed to set up S. The communication bus between 
two switches, that is, the crossbar width, can be 
one or more bits wide and, generally, it should 
match the processor (arithmetic) bandwidth. In 
addition, each memory submodule can independently 
communicate with memory submodules of its neigh- 
bors. This link allows any number of SPM modules 
to have their memory submodules organized as one 
uniform queue. 


We will assume that if any S, P, or M sub- 
modules is faulty, that the entire SPM module is 
faulty. Furthermore, any number of SPM modules 
can be faulty at any time. A module may be manu- 
factured faulty or it may fail during normal oper- 
ation. The problem is how to configure a partly 
faulty array of modules into a fault-free array. 
If this is done during the wafer testing it in- 
creases the yield, but a failure during the oper- 
ation brings the entire system down. If a fault- 
tolerant operation is required, the configuration 
algorithm must be distributed throughout the array 
so that upon the detection of a fault the multi- 
processor will reconfigure itself, excluding the 
faulty module from the set of available SPM mod- 
ules. For that reason, each P stores a processor 
status-map in which a 0 or 1 for each SPM indicate 
its status, good or faulty, respectively. After 
each fault detection, the processor bit-map is up- 
dated. 
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Generally, there are two problems in WSI de- 
sign. Firstly, an algorithm must be developed 
that can be easily mapped into the logical model 
of the WSI multiprocessor, that is, the fault-free 
array of SPM modules. 


Secondly, the logical model must be mapped 
into the physical model which is a partly faulty 
array of SPM modules on the wafer. It helps, as. 
is the case in our paper, if the logical model is 
one-dimensional and the physical model is two- 
dimensional. The algorithms for mapping into the 
physical model are beyond the scope of this paper. 
Some interesting work in this area is described in 
[AuCa78], [Kore81], and [FuVa82]. 


In this paper we describe the configuration 
shown in Fig. 2 and demonstrate its suitability 
for three important linear algebra problems. 


25 Algorithms 


Here we consider three simple, yet important, 
problems. The first deals with the determination 
of the distribution of the eigenvalues of a large 
positive definite tridiagonal matrix. This prob- 
lem arises frequently in the area of mathematical 
physics. The second problem, which is related to 
the first, is that of obtaining the distribution 
of the real roots of a polynomial of degree n, 


n 
P &®) = 2% v5 git in which all the coefficients 
1=0 


Y. are real and nonzero. Finally, the third prob- 
i 


lem is concerned with the solution of large posi- 
tive definite tridiagonal linear systems; a prob- 
lem that arises in numerous applications. 


For the first two problems we employ Rutis- 
hauser's quotient difference algorithm (the QD- 
algorithm), e.g., see [Ruti63],. [Henr58 and 63]. 
and [ScRS73]. For the third problem, we also use 
an iterative algorithm. Specifically, we use the 
cyclic Chebyshev semi-iterative method, see 
[Varg62], [Wach66]., and [Youn71]. 


2.1 The QD-—Algorithm 


(a) Problem 1 
Let 


T oa [B71 Ay> B,] (1) 


be the positive definite tridiagonal matrix under 
consideration, where Bs #0, 1<i<n-4t1. Since 


the eigenvalues Ay > Ao Pee uae of T are in- 


variant under similarity transformations, we con- 
A 


Sider instead the tridiagonal matrix T = prp 


mn 


where D is a diagonal matrix chosen such that 


(2) 


Applying the classical LR-algorithm to Jy: see 


[Ruti58 and 63] and [Wilk65], as given by the 
iterations 


i ee 
(3) 
Ja Be oe fe ha eae 
in which 
1 
en 1 
Ly = ee » and 
of) 
“n-1 I 
(k) 
qy 
(k) 
qo i 
(k) 
n-L 1 
(k) 
dn 


are unit lower triangular and upper triangular 
matrices, respectively, it is known that as 


k + © ol = I, and a * 4, Now, from the fact 
that Lay Rl = R,. Lys we can derive the QD-scheme 
for computing approximations to dj Lv 5 Ike 


The scheme is given by 
(1) _ 
qq a Oy» 
(1) 2,_(1) 
e, = P : 
J Bf 4, 
(1) (1) 
_ < aes 
Vier = 7541 7 Sf? ee ae 
and for k = 2, 3, 
(k) _ q Sk 1), ¢.¢k-1) _ (kK) 
ag Iga ME jy of 28 
(k) (k-1) ((k-1) ,_(k) ; 
: soa * ie e. ee 0 < <n-l1 4 
ey 441 j /4, Bee kepea (4) 
where al = eo) = 0. This QD-scheme is repre- 
sented by two Rhombus rules: 


(el) 
oT i OAL gen) 


“4 = 
qk 1), CKD _ 


ge 2 
j = j 


— 
=—=_ 


(5a) 


84 


(k-1) , .(k-1) _ (k) , (Kk) 
e e, _ e * e, 5b 
eee j "4 i, = 
(quantities to be computed are underlined). The 


two sequences {a\* )y and fe} are guaranteed to 
be positive, with lim a. = dX, and lim as) = Q. 


ko ko 
The rate of convergence, which is only linear, can 
be accelerated by various techniques, see [ScRS73]. 
Since we are only interested in the distribution 
of the eigenvalues, rather than accurate approxi- 
mation of them, the unaccelerated QD-algorithm is 
adequate for our purposes. For any iteration k, 
the eigenvalue ae lies in an interval of center 


(k) _ (ke) , Ck) 
Wp FT 4p Fe; 


bd 


and radius 


AGING) : 


Tj4a 


p= 


We terminate the algorithm when, for any k, 
k k 
cel 66), 


2< i<n, is less than a given 


tolerance. 


Fig. 3 shows the dataflow diagram of our algo- 
rithm. The obvious mapping of the dataflow dia- 
gram onto the chain of Ps is to assign each row to 
one P. In this case, one single value has to be 
sent from one P to the neighboring P during the 
time of two arithmetic operations. Therefore, for 
a balanced design, the communication bandwidth of 
a processor P must be b/min(t ;tt_ st tt.) bps, 


where b is the number of bits in each value sent 
and ta te ta and t_ are the execution time for 


division, multiplication, addition, and subtrac- 
tion, respectively. Furthermore, it takes 


Ik) and (n+2k) 


(k) (k) 
Ip poteee Gee. 


2k(t 4 tt tt tt a to generate af 
(k) 

+ 

(ty te te) to obtain q, 
(b) Problem 2 


Consider the n-th degree polynomial with unit 
leading coefficient 


x 7 Y1 gol 


x+y 


Po n 


<F ay 


where all the ¥,'s are real and different from 


zero. The roots of Po are, therefore, either 


real or appear in complex conjugate pairs. The 
QD-algorithm used for Problem 1 can be adapted to 
obtain all the real roots, as well as the real co- 
efficients of the quadratic factors whose roots 


constitute the complex conjugate roots of P Cs), 


e.g., see [Henr58, 64, and 67]. The QD-table for 
this problem is generated row by row using the two 
Rhombus rules (5), where the top two rows of the 
table are given by: 


(=jt1) 


(0) _ | 
on = Ya? q5 Ys 2 is J < n, (6a) 
and 
ore = -l, eed =] 2<j<n-1, (6b) 
with a0) = (7) = 0. 
0 n 


In Fig. 4, we show the flow of the computa- 
tion for n = 3. If the horizontal strip (indica- 
ted by dashed lines in Fig. 4) is mapped on one 
PE, then the communication rate between two PEs is 
b/min(t tt ,t +t.) bps. If only half the strip is 


—o 
= 


mapped on one PE, the rate is doubled. The execu- 
tion time is the same as in the previous example. 
We now state the following theorem regarding the 
convergence of the algorithm. 


Theorem [Henr67] 
Let the roots of POs) be Zi» Zoo sees then 
(i) for every j such that jz, | > Lene 
lim ah) = 0. 
kr>0o 


(ii) for every j such that Jz, > 2. | > lZsaal> 


lim a) = Z,. 
keoo J J 
(iii) for every j such that 2,4 > Iz, | > 2541 
> I2s4ol 
- ~(K) _(k) 
lim q; : = Z. Z.,,., and 
qs qj T5341 j jt 
 (k+1) (k) 
lim{q. +d: Lig eas 
eo Tyan) = 25% 541 


Hence, we terminate the algorithm when, for any k, 


(k) (k-1), _ -,(k-1) (k-2) 
ap tas) - (ay ta Ol se 
where € is a given tolerance. 


2.2 An Iterative Tridiagonal Solver 


Here, we assume that one requires solving 
linear systems of the form Tx = f, where T is that 
positive definite matrix given in (1) with all the 
diagonal elements On 1. Without loss of gener- 


ality, we assume that n is even. 
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Let 


~ 


and P be a permutation matrix such that the linear 


system cpt T P)(pr x) = (pt f) is of the form 


(R) 
I E x 
~n/2 ~ ~ = ~R (7) 
T (B) 
E q/2 = fe 


where E is the lower bidiagonal matrix 


(R) 


x and fe contain the odd-indexed elements of x 


(B) 


and f, respectively, and x and f. contain the 


even-indexed elements. The cyclic Chebyshev semi- 
iterative scheme is given as follows, see [Varg62]. 


a is chosen arbitrarily, 

i oe ae eae Pt gk Sw) , 
k=1, 2, 3 (8) 

ee ae te Terry tie Foc ee oe 


Here, wis are the optimal acceleration parameters 


and are given by 
2 
Wy = 2/(2-0°), 
j > 2, 


in which 9 < 1 is the spectral radius of the 
matrix ; 


ae) 


a ) 


Assuming that 9 is given, the iteration (8) may be 
written as 


Es n= 2 


3 2 
2. *.:= a random vector of order n /2 


~1 
T 
ee sees eee ee | 
4, i:= 2 
2 
= oe oe -1 


6. x,:= (1 - €,)x,_, + €,4, - E y,_,) 


Wy Ng asi e,) 


T 
(Leen ¥geg 1g Be) 


co 
ll 


Yi* 
9. i:s=idtl 


10. If stopping criterion is not satisfied, go 
back to 5. 


The stopping criterion is based on the convergence 
of re (x, in the above program segment) and is 


given by 
Hf, - Eys alle <€ 


where € is a specified tolerance. If 9 is not 
given a priori, we have several options: 


(i) If the system Tx = f is to be solved for 
many right-hand sidés (not necessarily all 
at once), then it may be advantageous to use 
our QD-algorithm in Problem (1) to evaluate 
the largest eigenvalue up = 1+ 0 of the tri- 
diagonal matrix 


T = [8° 


Pm 16 os A 


(ii) If T is strongly diagonally dominant, then 
we may either take all W's to be 1, i.e., 


use the classical Jacobi method, or estimate 
p by |{E]|,,- 


We show the flow of the computation, equation (9), 
in Fig. 5, where each node performs the computa- 
tion indicated either by step 6 or step 8. The 
quantities €, (1-€), and n, (1-n) are computed 
ahead of the main computation and stored in each 
PE as constants. Since there are four multiplica-— 
tions and three additions per node, the total time 
is t* = 4t + 3t.: Each node requires two compo- 


nents of x and y in addition to three constants 
Bea B. = nk However, the switch in Pi is 
used by PE (K-1) to communicate the other three 
constants (not used in the k-th PE) to the PE 


Hence, for a balanced design we must have 
8b(4t_ +2t tt.) = 1, (see Fig. 5). 


(k+1)° 
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3. Processor Design 


The processing submodule which satisfies the 
requirements given in Section 1 (Fig. 6) consists 
of an arithmetic unit performing floating-point 
division, multiplication, addition, and subtrac- 
tion, two. register-files with three registers each 


and four queues. The input queues DQ, XQ, and YQ 


are register files with a serial input port and a 
parallel output port, while the output queue OQ 
has a parallel input port and a serial output 
port. The communication through the switch sub- 
module is asynchronous and bit-serial. The oper- 
ands are always sent or taken from each queue in 
the same order. However, the order in which the 
operands are distributed among three input queues 
is not guaranteed. For example, the operand for 
the XQ may arrive before or after the operand to 
be stored in the YQ. If an input queue is empty, 
the control unit waits until the data arrives. In 
our statically scheduled operation it is not neces- 
sary to associate a validity bit with each operand 
(as in some dataflow machines), since the order is 
determined ahead of time. Two extra bits must be 
added to each data value to distinguish among the 
three input queues. Furthermore, if the data is 
sent to a nearest neighbor, two additional bits 
are needed to set up the switch in the neighboring 
SPM module. Thus, four bits of overhead are nec~ 
essary for the communication between nearest 
neighbors. 


As an example of processor operation, the 
dataflow-diagram of a node from Fig. 5 is shown in 
Fig. 7(a). There are four multiplications, two 
additions, and one subtraction to be performed. 
There are three system constants (stored in DQ) 
and two variables (stored in XQ and YQ) passing 


through the switch. The variable x j is already 
> 


in the XQ from the previous iteration. Each node 
in the dataflow graph generates one result, which 
is stored in the register indicated on the arc 
going out of the node. The sequence of micro- 
instructions for the given dataflow graph is shown 
in Fig. 7(b). An arithmetic operation and a move 
to the 0Q are performed in parallel. Similar 
microinstruction sequences can be written for 
other dataflow diagrams. | 


Using the present-day technology rate for 
floating-point arithmetic ([WaMc82], ty = 2.5 us, 


t. =t =t = .5 us) and communication rate (50 


MB/sec) for a wafer communication, we can compute 


the ratio t comm ‘arith’ If the 32-bit floating- 


oi rm i med, then = 
point format is assumed, the tom ‘arith 


(8x36/50x10~°) /(4t_, + 2t +t.) = 5.44/35 = 1.6, 


We see that our algorithm for problem 3 is commu- 
nication intensive. The performance can be im- 
proved by doubling the width of the switch from 
one bit to two bits. This will, of course, double 
the cost of the switch but will result in more 


balanced design, since the eee 3 eS will be- 


come 0.8 for 32-bit floating-point format. 


The algorithms for problems 1 and 2 are 


arithmetic intensive. 


4. Conclusion 


We have presented in this paper a multi- 


processor model for systolic algorithms. 


We think 


it is better suited for Wafer-Scale-Integration 


than the systolic-array model. 


In particular, in 


our model the communication is asynchronous and 

the model is fault-tolerant, which will improve 

the yield during manufacturing and allow graceful 
degradation during the operational life of the sys- 


tem. 


Since the packaging technology is the bottle- 
neck in system design, the iterative method in 
problem 3 compares favorably with direct methods. 
While the direct method requires roughly S(t, + 


ta, our algorithm needs time (1 + [j/k])(n + 2k) 


(4t + 2t. + t) for j iterations, where k is the 


number of SPM modules in the chain. 


For strongly 


diagonally dominant matrices the number of iter- 
ations, evén with the w's taken as unity, is 
small, and this iterative algorithm is competitive 


with the direct method for large n. 
< 0.8, and o = 1, then the maximum norm 


if ||E|| 


For example, 


of the error after 90 iterations will be roughly 


107° 


that of the initial error. 


Hence, for n = 


1000, one hundred good SPM modules on a wafer will 
yield a solution with a reasonable accuracy in 
roughly 0.8 the time needed by the direct method; 
we used the floating-point arithmetic rates stated 


above. 


The performance is approximately the same 


since most of the algorithms are I/0 limited and 


not arithmetic limited. 


Although our model takes 


more silicon than models used for direct imple- 
mentations, it offers fault-tolerance, simplicity 
and regularity which outweigh, in our opinion, the 


cost. 
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Fig. 3. Dataflow diagram for the eigenvalue problem 
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Fig. 4. Dataflow diagram for roots of polynomials 
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Fig. 5. Dataflow diagram for tridiagonal solver 


SWITCH 


R4 + XQ * DQ, 

0Q + DQ, POP DQ; 
R4 + R4 - DQ, 
0Q + DQ, POP DQ; 
R6 < RI * YQ, 

0Q « XQ, POP XQ, POP YQ; 
R3 «+ DQ * XQ, 

0Q + DQ, POP DQ; 
R3 «+ R4 + R33 © 
R3 + R2 * R3; 
OQ + R6 + R33 


(b) 


FLOATING-PO/NT 421TU M1, 


Fig. 6. Processor block diagram 


Fig. 7. (a) Dataflow diagram for node y., from 
Fig. 53 a 


(b) Microinstruction sequence for 
processor in Fig. 6 
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INTRODUCTION 


This paper discusses some results of a theo- 
retical and experimental study of a general proce- 
dure for implementing recursive and nonrecursive 
signal flow graphs and other similar arithmetic 
algorithms on synchronous digital machines com- 
posed of multiple identical programmable proces- 
sors. A fundamental feature of our approach is 
that it attempts to make maximum use of the Skewed 
Single Instruction Multiple Data (SSIMD) mode [1- 
3] of the synchronous multiprocessor system. When 
operating in this mode, the multiprocessor exe- 
cutes exactly the same program on all of the pro- 
cessors simultaneously, but with a fixed time skew 
imposed between the instructions execution times 
on the separate processors. In addition, the 
Single program utilized in the SSIMD mode is al- 
ways exactly a single processor realization of all 
the computations associated with a single time 
index of the signal flow graph. Hence, where the 
SSIMD mode of an appropriate single-processor im- 
plementation. 


A primary goal of this research is to develop 
procedures for automatically generating optimal 
multiprocessor signal flow graph implementations 
from a simple, non-parallel representation of the 
algorithms to be implemented. An appropriate al- 
gorithmic representation might be a set of dif- 
ference equations or a matrix presentation [2-4] 
of the signal flow graph. 


In a study of this type, it is very important 
to carefully define the criterion for optimality. 
In this study, three definitions of "optimal" are 
used. An implementation is said to be processor- 
optimal if the use of M processors leads exactly 
to an M-fold increase 
in the systems throughput as compared to a single 
processor implementation. An implementation is 
said to be time~optimal if the absolute theore- 
tical limit [5] for that signal flow graph has 
been achieved for the particular constituent pro- 
cessor. Finally, an implementation is said to be 
absolutely~optimal or just optimal if it is time- 
optimal and there exists no other solution which 
requires fewer processors. 


In these terms, it is now possible to sum- 
marize our results thus far. First, for a very 
large class of recursive signal flow graphs, the 
SSIMD approach results in absolutely-optimal im- 
plementations. This class includes all direct 
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form digital filters and their transposed forms, 
all lattice form digital filters, all parallel or 
cascade digital filters based on lattice or direct 
forms, and many more. Second, where an absolute- 
ly-optimal SSIMD solution exists, it can be con- 
structed automatically. Third, where an absolute- 
optimal SSIMD solution does not exist, a time- 
optimal solution can be constructed using a Paral- 
lel Skewed Single Instruction Multiple Data 
(PSSIMD) structure. In a PSSIMD implementation, 
two or more programs coexist in the same imple- 
mentation. The question of whether the PSSIMD 
time-optimal solution is absolutely-optimal is 
difficult to answer in general, but it is clear 
that the PSSIMD solutions obtained in this fashion 
are very efficient. 


THE SSIMD MODE 


The techniques of interest all utilize the 
Skewed Single Instruction Multiple Data (SSIMD) 
mode of the synchronous multiprocesors to realize 
the implementations. In this mode, exactly the 
same instruction stream is executed on all the 
processors, but a fixed time skew is maintained 
between instruction execution times on separate 
processors. ? 


The fundamental concept is illustrated by the 
simple example of Figure 1. In this example, the 
second order direct form filter of Figure la is 
implemented as a single processor program as shown 
in Figure lb. In this single processor realiza- 
tion, none of the delay elements are realized di- 
rectly, but rather the output from each delay 
element becomes an input to the program and the 
input to each delay element becomes an output from 
the program. In the SSIMD realization, these de- 
layed values are not computed by this processor, 
but are supplied from identical computations on 
other processors. 


Figure 2 illustrates the fundamental charac- 
ter of an SSIMD solution. In the one processor 
solution of Figure 2a, the same processor which 
generates the output point r(n) is also the pro- 
cessor which has generated all the previous output 
points, r(m) for m <n. Hence these points are 
always available when needed. In Figure 2b, a two 
processor solution is illustrated. The key point 
is that, even though the value of r(n-1) must be 
available before r(n) is computed, it is not nec- 
essary for it to be available before the computa- 
tion of r(n) is begun. What is required, rather, 


is that the value of r(n-1) must be available be- 
fore it is used by processor 1. Hence processor 1 
may be started as soon as it is guaranteed that 
r(n-1) will be available from processor 2 before 
it is needed by processor l. 


Figure 3 shows the diagram for a one proces~— 
sor, a two processor, and a five processor real- 
ization for the signal flow graph of Figure 1. In 
the single processor solution of Figure 3a, all of 
the past values of r(n) are supplied by the same 
processor, and there is never any issue of data 
availability. In the two processor realization of 
Figure 3b, alternate points are supplied by each 
processor, and the two processors must be skewed 
such that the data requirements of each is always 
met by the other. Likewise, Figure 3c shows a 5 
processors solution in which every 5th set of 
points is supplied by each of the 5 processors. 
It should be noted that all these SSIMD solutions 
are "free running" such that whenever a processor 
completes the computations associated with one 
time index, it immediately begins the computations 
associated with another time index. Hence, each 
program realizes an infinite loop (one time index 
per loop) and, under the assumption that the pro- 
gram timings are not data dependent, each loop 
takes exactly the same amount of time. Thus, if M 
processors are started at M starting time, t 0 <m 


< 


M-1, then the relative time skew so imposed re-_ 


mains fixed untill the program is halted external- 
ly. Hence, the program of implementing a par- 
ticular recursive and iterative arithmetic prog- 
ram reduces to specifying the M starting times, t 
eooee t, ,, such that all the data necessary in 
the various computations is available before it is 
needed. 


Fixed Program Implementations 


The problem of implementing a particular re- 
cursive signal flow graph in SSIMD mode can be 
divided into two related problems. The first pro- 
blem is that of finding and characterizing all 
legal SSIMD solutions for a particular single pro- 
cessor program realization of the signal flow 
graph. The second problem is that of constructing 
the particular single processor program so that 
the eventual SSIMD solution will be optimal. This 
section addresses the first problem. 


In fitting the program together in the SSIMD 
mode, the only information necessary concerns the 
length of the program, the times at which recur- 
sive inputs are needed, and the times at which 
recursive output are available. Hence, a program 
with a single recursive output such as that of 
Figure 1 can be characterized as 

K(1(G) 5: TH), 64005 FOL) Ry DT) (1) 
where K is the task identifier, T is the task 
length, R is the output time for the recursive 
output, I(%) is the input time for the {th delayed 
recursive output, and L is the value of the long- 
est delay. The important theoretical results for 
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this environment can be summarized as follows [3- 
4]: | 


1) All SSIMD M-processor solutions are bounded by 
the solution in which the processors are started 
at equal intervals and the outputs are periodic. 


For such a solution, the time between outputs is 
T/M and 


¢ = ML 0 <m< M-1 (2) 
Stated another way, if an M-processor (processor- 
optimal) solution exists, it can be implemented 


with the above starting times. 


2) The maximum number of processors which can be 
used in such a solution, M, is given by 
M. = INTIM(2,)] = ee (3a) 
L 
& T 
M = M = 
7 Wr wan Foead (3b) 


where M(2) is the non-integer number of processors 

which could be utilized if the only constraint 

came from the recursive input I(2), 2 is the 
: : ae: x 

value of & for which M(2) is minimum, and INT[>°] 

means "the integer part". 


3) Any SSIMD implementation for the given program 
can be obtained with uniform time skews as shown 
in (1) above so long as M < M 


4) The greatest throughput which is achievable by 
these techniques is obtained with a time skew of 


1(Z,) -R 
oe ae (4) 
x 
This solution is generally achieved with M + ] 


processors by adding extra non-functional delays 

to the program, and although time-optimal, is gen- 

erally not processor-optimal. A solution which is 

both time-optimal and processor-optimal occurs 

for M, processors only for the unlikely case of M 

= M(2.). ‘ 
x 


5) Time-optimal solutions are available for M_ + 
1 processors, and the addition of more than one 
processor will never increase the throughput be- 
yond a sample rate of 1/t_. 


Based on these results, three important fea- 
tures should be noted. First, given a single pro- 
cessor program for a signal flow graph or other 
algorithm describable as in equation (1), the max- 
imum number of processors which can be used is 
immediately available (eq. 3b) and the starting 
times in the SSIMD solutions are trivially simple 
to compute (eq. 2). Hence, for a given program 
the SSIMD implementation procedure is very simple. 
Second, and more important, the maximum number of 


processors which can be utilized (eq. 3) is a fun- 
ction of only a single input time, I(2_). Hence, 
a simple constraint exists for optimizing a par- 
ticular program for an SSIMD implementation. This 
program is obtained by maximizing the minimum 
value of M(2_). Finally, and perhaps most impor- 
tant, the optimum time skew, t_, is a function of 
neither the program duration or the number of re- 
cursive inputs or outputs for the program. This 
allows for several important generalizations to be 
made, and, for properly written programs, leads to 
very impressive solutions. For example, the 
system of Figure 1 can typically be implemented 
with 8 or 9 processors, even though it has only 
two recursive inputs. 


OPTIMAL SIGNAL FLOW GRAPH IMPLEMENTATEONS 


Based on the results summarized in the pre- 
vious section, it is clear that any program which 
implements a signal flow graph can be the basis 
for an SSIMD solution. In regard to the optimal- 
ity of such programs, three separate issues must 
be addressed. First, how is the maximum through- 
put, the rate which defines a time-optimal solu- 
tion, determined for any signal flow graph? 
Second, how is the question of the existence of a 
time-optimal SSIMD solution to be addressed, and 
how is a time-optimal SSIMD solution constructed 
if it does exist? Finally, if no time-optimal 
SSIMD solution exists, how can a time-optimal 
PSSIMD solution be constructed? 


The first question has been addressed in a 
paper by Renfors and Neuvo [5], in which they show 
how to determine the maximally attainable through- 
put for any signal flow graph given the arithmetic 
constraint of the processor. The procedure can be 
summarized as follows. The first step is to ex- 
pand the signal flow graph node structure so that 
all arithmetic operations occur as_ individual 
branches (see Figure 4). Note that this procedure 
deterministically sets the precedence relations 
among all arithmetic operations. The second step 
is to measure the arithmetic delays and count the 
delay elements (z  ~) associated with each 


loop. The minimum sampling period (and hence the 
maximum throughput) for such a system is given by 
Ty 
TY. = MAX N 
all R (5) 
loops 
where T, is the total arithmetic delay in the 2th_ 


loop and N, is the total (integer) number of delay 
elements in the £th loop. For an implementation 
to be time-optimal it must attain this limit. 


Using the above result, it is a straight- 
forward process to construct all the time-optimal 
SSIMD solutions which exists, and the best SSIMD 
solution if no time-optimal solution exists. All 
that is required is to first construct all pos- 
sible signal flow graphs of the type of Figure 4 
from the original signal flow graph. This is ac- 
complished by systematically expanding each node 
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into all possible sets of nodes each of which in- 
volves only one multiply and one add. Then, for 
each of these expanded signal flow graphs, all 
possible arithmetic orderings can be enumerated in 
a directed search using the intrinsic precedence 
relations [6]. Each of these orderings consti- 
tutes a program for realizing a single processor 
implementation, and the maximum throughput for 
each is computable from eq. (4). The program(s) 
with the minimum value of t_ are the best SSIMD 
solution, and if t_ = T , then that solution is 
F : x Sine : ae 
time-optimal. A key point in this regard is that 


if a time-optimal SSIMD solution exists, then it 


is also absolutely-optimal. This is because the 
SSIMD solution uses all the cycles of all the pro- 
cessors on the algorithm, which is the best that 
can ever be achieved. 


It is possible to structure the searching 
procedure described above so as to simultaneously 
construct PSSIMD solutions if no _ time-optimal 
SSIMD solution exists. The procedure begins by 
removing all the delay elements from the signal 
flow graph as shown in Figure lb. Then all of the 
loops in the system are tabulated as follows. 
First, all first order loops are found by replac- 
ing each delay element and tabulating any result- 
ing loops. Second, all second order loops are 
found by replacing all delay elements in sets of 
two, and tabulating all loops not previously 
found. This process is repeated for increasing 
numbers of delay elements until all loops are tab- 
ulated. One of the loops tabulated in this pro- 
cedure must be the limiting loop of eq. (5), and 


this loop must occur as a set of contiguous arith- 


metic operations in any SSIMD solution. Hence, 
the optimal program construction task reduces to 
ordering the remaining operations so as not to 
violate equation (5). 


For many classes of recursive systems, such 
as direct form, cascade direct form, and parallel 
direct form filters, the individual loops do not 
"overlap" or, in other words, they do not share 
arithmetic operations. For such system, each loop 
can be implemented separately in several allowable 
orders and several absolutely-optimal SSIMD solu- 
tions exist. For other systems, such as the lat- 
tice filter of Figure 4, the loops overlap, but it 
is possible to order the remaining operations so 
as not to violate equation (5). For still other 
systems, such as the coupled form, no time-optimal 
SSIMD solution exists. Alternately, a time-opti- 
mal PSSIMD solution can be constructed by system- 
atically "offloading" the loop overlap operations 
to other processors. These secondary "slave" pro- 
cessors are synchronously locked to the "master" 
processor, which is defined as the processor which 
implements the limiting loop. This PSSIMD con- 
struction can always be used to construct a time- 
optimal solution, but the question of its ab- 
solute-optimallity is harder to address. It is 
clear, however, that the slave processors still 
automatically benefit from the same SSIMD gains as 
the master processor. This means, for example, 
that if the total arithmetic processing time for 
the master processor is Ty and the total arith- 


metic processing time for a slave processor is 
T,,/N, then N master processors can be serviced by 


one slave. 


DISCUSSION 


All of the above results are really a reflec- 
tion of the intrinsic data flow constraints in- 
herent in recursive algorithms, and they are ob- 
tained by mixing three sets of constraints: the 
fundamental recursive constraints of the algor- 
ithm; the simple, highly structured constraints of 
the signal flow graphs; and the constraints impos- 
ed by SSIMD realizations. There are several im- 
portant points here. The first is that if a SSIMD 
mode exists in a multiprocessor, then there are no 
better ways for implementing many digital filters. 
This fact is made even more attractive by the fact 
that the SSIMD implementations are generally much 
simpler than other multiprocessor options which 
typically involve the parsing of the signal flow 
graph to exploit the local parallelism. The 
second important point is that all the limits on 
the number of processors and the throughput (t_) 
are a reflection of the recursive nature of the 
algorithms. If the programs are not recursive, 
then the programs can be implemented such that 
there are no such constraints, and the number of 
usable processors goes to infinity. What this 
means, clearly, is that the solution is no longer 
constrained by the algorithm but rather by the 
nature of the I/O hardware. The key point here is 
that the SSIMD solution for non-recursive programs 
is processor-optimal for any number of processors, 
and these solutions are even simpler to implement 
than the solutions for the recursive case. 


The largest potential problem in SSIMD solu- 
tions concerns the inter-processor communication 
issues. Since the entire SSIMD development has 
been done under the assumption that the processors 
could communicate “at will", this would at first 
seem like a critical issue. It turns out, how- 
ever, that it is not. This is true for two rea- 
sons. First, the fundamental periodicity of the 
SSIMD solution makes the communications require- 
ments very uniform, which avoids many potential 
time conflicts. Second, and most important, the 
nature of the communications environment can be 
systemtically controlled. To see this, one simply 
needs to note that the number of processors with 
which a particular processor must communicate is 
controlled by the Maximum length of its delay ele- 
ments (see Figures 1, 2, and 3). The use of long 
delay chains does improve the final solution since 
it leads to SSIMD realizations which require fewer 
processors to realize a time-optimal solution. 
But the entire procedure still works if the maxi- 
mum delay length is constrained to be one. In- 
deed, this is true in the classical formulation 
for the signal flow graph [4]. For such realiza- 
tions, each processor only talks to its two near- 
est neighbors, and the communications is always in 
one direction. Such realizations have the same 
Maximum throughput rate, but, in general, require 
a few more processors to attain it. Most impor- 
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tant, however, they have a communication environ- 
ment which is always trivally implementable. 
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x(n) 


y(n) = borin) +b, r(n—1— + bor(n—2) 


r(n) = a4r(n—1) + agr(n—2) + x(n) 


Figure la: Signal flow graph for a 
2nd order recursive direct 


i | lizati for 
form II digital filter. Figure 1b: Single processor realization fo 


the signal flow graph shown above. 
All delays. are not implemented by 
the program, but these are realized 
by the parallel structure. 


Figure 2a: Single processor implementation of 
the signal flow graph of Fig. 1. 
All points in the output stream 
and all values of r(n) are computed 
by the same processor. 


Figure 2b: Two processor implementation of the 
signal flow graph of Fig. 1. The 
computation of r(n) by processor 1 
can be started as soon as processor 
2 has been running long enough to 
guarantee r(n-1) will be available 
when needed by processor 1. 


Figure 3a: In a single processor SSIMD reali- Figure 3b: In a two processor SSIMD realization, 
zation, all recursive outputs are alternate recursive outputs are 
supplied by the same processor. supplied by each processor. 
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Figure 3c: In a five processor SSIMD realization, every ae time 
index is computed by each processor. The processors 
are skewed in time so as to guarantee all recursive 
inputs are available when needed. 
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LOOP LOOP LOOP ARITHMETIC Ty 
INDEX SEQUENCE OPERATIONS DELAY 
-2=4- * * /4+ 
1 1-2-4-8 1*/1+/2+/4*/4 2a. + 3d, 2a, + 3d, 
2 2-9 2*/2+ ad +d d +d 
m a m a 
-2= i “|e 
3 1-2-9-4-8 1*/1+/4+ a. + 2d. q/2 + a. 


MINIMUM SAMPLING PERIOD = T. = 2a + 3d #& T 
1 m a 0 


OPTIMUM SSIMD SOLUTIONS 2*/1*/1+/2+/4%/4+ °*°* other operations °°* 


Figure 4: Example of the derivation of an optimal SSIMD program 
for a 2nd order lattice filter. Each node involves one 
multiply, "n*'", and one add, "n+", where n is the node 
number. The loop tabulation gives the minimum sampling 
period, T,. The program has two delay outputs, R, and 
R,, and two delay inputs, I,(1) and I,(1). The Program 
construction procedure gives the ordering indicated, which 
gives a value of t_=T,. Thus, the SSIMD solution is 
absolutely-optimal. Note: The storage and I/O operations 
have been left out of this analysis for simplicity. They 
can easily be included in the analysis. ] 
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A TEST STRATEGY FOR PACKET SWITCHING NETWORKS() 


Willie Y-P. Lim 
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Abstract -- A test strategy for packet switching networks 
is described. The effect of a single stuck-at fault is either 
misdirected packets, missing packets, corrupted data in packets, or 
‘multiple packets. A fault can either prevent packet transmission 
or affect the integrity of the data sent in the packet and if ts 
detected as one of 4 cases -- both output ports of the switching 
element inaccessible to an input port, an output port inaccessible 
“to an input port, an input port permanently connected to an 
output port and erroneous packet length. 


“Introduction 


Packet. communication architecture has been discussed in 

the .context of implementing .-data-flow machines [1]. Such 
systems use packet switching networks for inter-processor 
connection. In [1] for example, the network used is composed of 
packet switching elements called 2x2 routers. Packet switching 
for another class of networks are discussed in.[4]. Each packet is 
routed through the network using the information carried inthe 
packet. Due to this distribution of the switching function, many 
packets can be simultaneously transmitted through the various 
stages of the network. When asynchronous or self-timed 
communication protocols ate used, the testing of such networks 
requires new approaches. A strategy for testing such networks is 
described in this paper. 
, Fault diagnosis of networks has been studied by [8] for 
‘on-line. fault diagnosis and by Wu and Feng [7]. The work in [7] 
dealt mainly with the fault diagnosis of networks in which the 
switching. elements have single bit inputs. Wider inputs are used 
in packet communication. Furthermore the packet format and 
communication protocol used affect the test and fault diagnosis 
strategy. 


Packet Format and Packet Communication Protocol 


A packet is a sequence of bits and is usually transmitted as 
a sequence of sub-units with each sub-unit being some fixed 
‘number of bits. For convenience, a sub-unit is referred to as a 
byte. The number of bits in a byte is usually determined by chip 
pin-out. and communication bandwidth considerations. . The 
information contained in a packet is composed of the destination 
address, the data to be sent and the length of the packet. Since 
only packet switched networks are considered in this paper, the 
destination address is necessary for routing the packets through 
the network. ‘The packet length information can be included in the 
data transmitted, or an extra bit can be used to indicate which 
‘byte is the last one in the packet. 

Packets are assumed to be transmitted using asynchronous 
communication protocols. The transmission of each packet or byte 
is accompanied by an. event signalling the arrival of the packet or 
byte at the destination and each successful receipt of a packet 
must be acknowledged by the explicit sending of a control signal. 
For example, a special signal may be used to indicate the arrival 
of the packet, or the arrival event may be encoded in-the data 
signal lines as in the “dual-rail" communication protocol [2]. 
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A Packet Switching Network 

The switching element. in the network is a 2x2 router which 
receives packets at its two input ports and sends them out aft its 
two output ports. The least significant bit of the address byte of 
the packet is used for selecting the output port for sending the 
packet. Output ports can be independently selected by the input 
ports and, if there is contention for an output port, only one of 
the input ports is connected while the other waits until the output 
port becomes free, i.e. the input is temporarily blocked. If there 
is no contention for an output port, then the packet transmission 
from an input port to an output port can proceed in parallel with 
a non-conflicting one. The various input-output port 
configurations possible are shown in Figure 1. The leas st 
significant bit of the destination address byte having a value of O 
will cause output port 0 to be selected while output port 1 will be 
selected if that bit ts 1. 

The packet switching network has the same interconnection 
structure as the baseline network [5], [6]. Figure 2 shows the 


structure of a 16x16 network. Each router in the network is 


connected to another router or a processor through links. For a 
network with N input ports and N output ports, there are log, N 
stages of routers and | + log, N levels of links. The ports of the 
routers in each stage are numbered from top to bottom starting 
with O at the top. These numbers are not shown in Figure 2. 
Instead the destination addresses of the output ports of the last 
stage of the network are shown. If P, P, .P, Pg with 
s = (log, N) - 1 is the bit veeecentalion of the Sarl Aaaben then 
the router number in that stage is given by the value of the bit 
string P, P,_, .--P,, i.e. by dropping the least significant bit of 
the port number. With this network structure, destination address 


bytes of the same value will route packets to the same output 


port of the network. The number of output ports that can be 
addressed, ie. the network size, is fixed by the number of bits in 
the destination address byte. Port and router numbers are 
important for identifying ports. and routers within a given stage 
during testing or fault diagnosis. In this paper, the network is 
assumed to be for connecting N processors, where N (> 0) is some 


power of 2. 
0 8) 
Inputs | fm 


(a) (b) (c) (d) (e) (f) 


Configurations with no blocking 
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Configurations with blocking 


Figure 1. Port Configurations of the 2x2 Router 
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Figure 2. A 16x16 Network of 2x2 Routers 


Fault Model 


One. or more of the following effects will be produced by 
every single stuck-at fault occurring in a link or router of the 
network - 1) misdirected packets, 

2) missing packets, 
3) corrupted data in packets, and, 
4) multiple packets being received. 

The types of faults occurring in the network can be divided 
into two classes. The first of these is the class of faults that 
affects the asynchronous communication protocol. Examples of 
faults in this class include the packet acknowledge or packet 
control signals being stuck at one of the logical values or a 
switching element being stuck in some erroneous state due to a 
fault occurring inside it. The effect of this class of faults is 
missing packets, i.e. no packet is received when one or more is 
expected. This occurs when the packets fail to arrive within some 
specified time which is larger than the normal packet transmission 
time. The packets are held up somewhere in the network due to 
faults. The other class of faults affects the integrity or 
interpretation of data in a packet. A stuck-at fault occurring in a 
link, for example, can cause packets to be misdirected due to an 
erroneous destination address bit being used. Or, the occurrence 
of an internal fault in a switching element can cause the address 
bit to be interpreted wrongly. If the full address space available 
is not used, a fault in a link need not necessarily cause packets to 
be misdirected. We may get instead, erroneous data in the 
received packet. 

Figure 3 shows the various faulty router configurations. 
The dashed lines indicate the connections that are operable while 
the dark lines indicate the connection being permanently fixed. 
Case (a): in the figure is for faults that prevent packet 
transmission through an input port while case (b) is for faults that 
prevent packet transmission through an input-output port pair. 
The third case is for faults that cause an input port to be 
permanently connected to an output port. Note that a connection 
is said to be good if packets can be sent through it using the 
asynchronous communication protocol. Hence the case of 
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(a) Both output ports inaccessible to an input port 


(b) One output port inaccessible to an input port 


(c) One output port permanently connected to an input port 


Figure 3. Faulty Configurations of the 2x2 Router 


corrupted data in a received packet at the proper destination is 
not shown. Neither is the case where the faulty router sends 
pees to two output ports considered. This is because in order 
hat the router be able to send packets to two output ports, the 
fault must make it behave like a fork in sending the packet arrival 
signal out and like a merge in receiving the packet acknowledge 
signals. We assume that the design of the router is such that this 
will cause packet transmission to hang. 


Test Strategy 


The test strategy proposed here involves checking that 
packets can be sent through the input ports of every router in 
the network. To do this, the two test phase approach discussed 
in [7] is used. Since each phase involves a single test as each 
router is tested by the transmission of a single packet through it, 
we call. the two phases tests 1 and 2. In test 1, each router input 
port is checked to see if it can be connected directly tothe 
corresponding output port -- input port 0 to output port 0 and 
input 1 to output port 1. All routers are set up as in (e) of Figure 
1 by using the proper destination addresses. Test 2 checks to 
see if each input port can be connected to the output port across 
from it -- input port 0 to output port 1 and input port 1 to output 
0. This means that all routers are set up as in (f) of Figure 1. 
Note that in both cases packet transmission through each input 
port of the router is independent of each other. 

In each test, exactly one specially formatted packet is sent 
from a source to a destination and there are exactly N such 
source-destination pairs that will be communicating concurrently. 
Hence if. the network is working properly, each processor will 
send and receive exactly one packet. The format of the packet 
used depends to some extent on the router implementation. In 
any case, the source address is also included in the packet. The 
source address in the received packet is checked to make sure 
that the packet received is sent by the expected source. Some 
test bit patterns are also sent to check for stuck-at faults in the 
data bits. This test pattern is composed of two bytes, the first of 
which is an alternating sequence of O’s and 1’s, while the second 
is the same sequence rotated by 1. The width of the test pattern 
is the same as the width of the data path of the byte serial 
transmission used. In those implementation where the Last Byte 
bit is used, the length of the packet is also included to check for 
stuck-at faults in that bit. If the packet length is fixed during the 
test, the length information need not be sent as data in the 
packet. - 

With this test strategy, if case (a) of Figure 3 occurs, the 
effect will be two missing packets - one for each test. In this 
case a stuck-at fault occurring in the attached input link of the 
router cannot be distinguished from one that occurs inside the 
router. Case (b)-will have the effect of a missing packet in one of 
the tests. In case (c), the effect will be a missing packet and 
more than one packet received by a destination in one test. If a 
fault occurs that causes packets of the wrong lengths to be sent, 
the destination will see a shortened packet and it as well as some 
other destinations may receive additional packets. 


. Fault Diagnosis — 


If a fault is detected i in the test, the fault diagnosis strategy 
described ‘in [7] is used to identify the faulty router. However, it 
is important to note that the strategy given in [7] deals only with 
single bit input lines, while in this paper we are dealing with 
multiple signal lines carrying bytes observing some asynchronous 
communication protocol. Hence, instead of getting faulty output 
‘patterns, we. get one of the effects described earlier. For 
example, the logically unidentified output value (open circuit) “-" 
and logically erroneous value (two independent logic signals being 
tied together) "¢" correspond to the effects of missing packets 
and multiple packets, respectively. 


Both Output Ports Inaccessible to an Input Port 


Since in this case a fault occurring in a link cannot be 
distinguished from.one that occurs in the connected router, the 
fault is assumed to be in a link. Once the link is located, further 
tests are then done to locate the actual fault. Since a link ts on 
exactly one path for each test, the set of links that are on the 
faulty path can be identified as follows. Each link is identified by 
the number of the input port that it is connected to. In test 1, if 
Pecks -P, Po is the link that is connected to the source 
processor then ‘the Us at the output side of the i-th stage is P, 
Pips oP, Pe inp 9 Where O Sis. Similarly for test 2, the 


link at fhe Stu side of the i-th stage is Py P, .P, P, 
eg Piz Pi, In test 1, the link at the pateut ide of the i-th 
stage is identified by rotating, to the right by. 1, the righmost 
$-i+1 bits of the link number of the previous stage while in test 2, 
the process is the same except that the least significant bit of the 
‘S-i+1 bits is always complemented after the rotation. To identify 
the “faulty” link, the source addresses for the two tests. are 
obtained from the destination addresses of the processors. that 
did not receive a packet. Note that the source-destination 
addresses are related as follows: for test 1, the addresses are bit 
reversals of each other and for test 2 they are the complement of 
the bit reversals of each other. The set of links of the path is 
determined for each test. The “faulty” link is the intersection of 
the two link sets. Two tests are required for locating the “faulty” 
link. and to determine if the fault is in the link or the router, one 
more test is necessary. This test involves checking to see if 
packet arrivals and packet acknowledgments can be detected at 
the input port of the router. The absence of the former means 
that. the link is bad and the. absence of the latter means that the 
router is bad. — 


An Output Port Inaccessible to an Input Port 


| For this case, there is only one destination that receives no 
packets for both tests. To locate. the faulty router, a binary 
search is done. The objective of the search is to identify the 
stage in which the router is located. Knowing the path and the 
stage, the router can be pinpointed. A search tree with each of 
the stages of the network as leaves is ceperaclee: Starting at 


‘the root, the stages 0 to: 3 (ifs is even) or 


in the left subtree and the rest of the stages in the right subtree. 
The left subtree is set up to be of the same configuration as the 


test in which the fault occurs while the right subtree is set up in: 


the same configuration as the other test. The network is then 
tested. If no faulty response is obtained then the fault is in the 
right subtree; otherwise it is in the left subtree. This process is 
repeated for the faulty subtree until the stage is located. The 
number of tests required is of the order log(log N). 


An Qutput Port Permanently Connected to_an Input Port 


In this case one of the fests will give two faulty responses 
~ missing packet and multiple packets at two distinct destinations. 
From the test at which the fault occurs, the fault type can be 
determined - for test 1, the left two cases of (c) in Figure 3 and 
for test 2 the other two cases. At most 2 tests are required to 
locate the router. 


> (if s is odd) will be: 
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(6) 


Erroneous Packet Length 


For this case, the destination will receive a shorter than 
expected packet. Since the fault may occur in a link or a router, 
depending on whether the Last Byte bit is used or not, the 
situation is similar to'that of both output ports being inaccessible 
to an input port. A fault has the effect of sending fragments of 
the packets through the network. More than one destination may 
receive multiple packets; all but one of these will receive a normal 
packet followed by at least one erroneous packet. The remaining 
one destination will receive one shortened packet followed 
possibly by some erroneous packets. The faulty path is identified 
by the latter since it is the proper destination and is guaranteed 
to receive at least the destination address byte of the packet. 
Each test will give a faulty path and the intersection of the set of 
links or routers in the two paths is the faulty link or router. 


Summary 


A test strategy for packet switching networks has been 
presented. The strategy is developed for byte serial packet 


communication using an asynchronous communication protocol. It 


has been shown that the effect of a single stuck-at fault can be 
classified into misdirected packets, missing packets, corrupted 
data in packets, or multiple packets. There are basically two 
types of faults - those that prevent packet transmission and those 
that affect only the integrity or the interpretation of the data 
sent in the packet. The presence of a fault in the network will 
show up as one of 4 cases - both output ports of the switching 
element not accessible to an input port, an output port not 
accessible to an input port, an input port permanently connected 
to .an output port and erroneous packet.length. An approach for 
fault location is also presented and it is shown that the number of 
tests required is either constant or of the order of log (log N). 
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Abstract -- It was shown previously that 
four tests are required in order to detect single 
faults and to locate single link stuck faults for 
a class of multistage interconnection networks, 
In this paper we show that only three tests are 
actually necessary and sufficient both to detect 
single faults and to locate single link stuck 
faults. The test schemes described achieve the 
least number of tests required for detecting and 
locating such faults. 


Introduction 


In a paper previously presented at this con- 
ference [1] it was shown that four tests are re- 
quired in order to detect single faults and to 
locate single link stuck faults for a class of 
multistage interconnection networks. This paper 
is to show that only three tests are actually 
necessary and sufficient to detect and locate 
such faults. 


Fault Model 


The fault model described here applies to a. 
class of multistage interconnection networks [2], 
although the discussion is mainly on the baseline 
network. The interconnection network discussed 
in this paper consists of Nlog,N/2 switching ele- 
ments where N is the number of inputs and 
Nlog,(N+L) links, Each switching element has two 
inputs and two outputs, and it can have only two 
valid states as shown in Fig. 1. The faulty and 
the valid states constitute the 16 possible states 
of the switching elements listed in Table I. The 
faults to be diagnosed for a switching element in 
valid states S10 and S5 are listed in Tables II 
and III, respectively, where "-—" means the logi- 
cally undefined output and "$" means logically 
erroneous output resulting from the simultaneous 
input of 0 and 1. It is assumed that - and 6 can 
be differentiated from each other and from 0 and 
1 during the test. The links of the network can 
have stuck kind of faults (Tables II and IIT). 


Detection of Single Faults 


According to the fault model, a fault in an 
interconnection network can be either a link 
fault or a switching element fault. A link fault 
can be either a stuck-at-0 or stuck-at-1l. A 
switching element fault can be considered to be 
the malfunction of the switching element from its 
valid states. 


Theorem 1: Three tests are necessary and 


sufficient for detecting single faults in a base- 
line network constructed of switching elements 
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with two valid states S10 and-S5. 


Proof: Consider one switching element with 
inputs x,, x» and outputs x,, %, first. To de- 
tect a single fault we need at feast two tests, 
one for switching element at state S10 and the 
other at $5. From Tables II and III it can be 
seen that the test (xy> x,) = (0, 1) or (1, 0) 
can detect all types of so and/or S5 malfunc- 
tions, but the test (x)5 X») = (1, 1) or (0, 0) 
cannot. However, any combination of the two 
tests [(0, 1) for both S10 and S5, (1, 0) for 
both $10 and $5, (0, 1) for S10 and (1, 0) for S5, 
or (1, 0) for S10 and (0, 1) for S5] is not suf- 
ficient to detect all the link faults. In other 
words, one additional test is needed during either 
S10 or S5 test. Therefore, at least three tests 
are required. Let (0, 1) and (1, 0) tests be 
used for switching element functioning at $10 so 
that any single link fault or the 510 malfunction 
can be detected. Then, let (0, 1) test be used 
for the switching element functioning at S5 to 
detect the $5 malfunction. Thus, three tests are 
necessary and sufficient to detect single faults 
for the network. 


‘Detection and Location of Link Faults 


There are two test phases as shown in Fig. 2. 
During Phase 1, the input terminals, labelled in 
binary numbers, with even or odd number of 1's 
receive input vector 01 or 10 (or alternately 10 
or 01), respectively. Based on the result of 
Phase 1 test, all input terminals then receive 
either all 1's or all O's (Fig. 2.b or 2.c) dur- 
ing Phase 2 test. Fig. 3 shows an alternate test 
scheme. 


Theorem 2:. Independent of network sizes 
three tests are necessary and sufficient for de- 
tecting and locating single link faults ina 
baseline network constructed of switching ele- 
ments with two valid states $10 and S5. 


Proof: The necessary condition is quite ob- 
vious because it requires at least two tests 
(Phase 1) in order to detect the link faults and 
at least one additional test to locate the fault. 
The sufficient condition can be proved due to the 
fact that during Phase 1 test the type of link 
stuck fault is determined and unique faulty path 
can be computed between the faulty output and its 
input [1], thus, only one subsequent test is re- 
quired to determine the other faulty path during 
Phase 2 so that the intersection of these two 
paths gives the faulty link. 


Fig. 4 gives an example of the detection and 
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location of link faults. Since Phase 1 test Table II. Faults, Test Inputs, and Outputs in Valid State $10 
identifies the link fault to be a stuck-at-0 type, 

every input terminals then receives a 1 during the Fault nese Queput 

Phase 2 test. From these two tests the possible Faulty 


. % x x, £ 
faulty links are identified to be (6, 6, 3, 5, 6) . rary =e: 
for Phase 1 and (7, 6, 2, 0, 1) for Phase 2, In- a3 = = 
tersecting these two sets we find that the link “ eee ese SS 
stuck-at-0O fault is located at link 6 of level l. i ii ce 
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The test schemes described in this paper 
achieve the least number of tests required for 
detecting single faults and locating single link 
stuck faults for a class of multistage intercon- 
nection networks. It is obvious that additional 
tests are required in order to determine the type 
and location of switching element faults. 
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Fault Tolerance Analysis of 
Several Interconnection Networks 


John Paul Shen 


Department of Electrical Engineering 
Carnegie-Mellon University 
Schenley Park, Pittsburgh, PA 15213 


ABSTRACT -- A B-network is an interconnection network 
composed of 2 x 2 switching elements called B-elements. 
B-networks can be used as multicomputer communication 
networks. in a previous paper, a theoretical framework 
facilitating the fault-tolerance analysis of B-networks was 
developed. In this paper, the analytical results from the earlier 
work are applied to the analysis of several well-known 
B-networks. These B-networks include the shuffle-exchange 
network, the double-tree network, the indirect binary n-cube 
network, and the Benes rearrangeable switching network. A 
formal technique for describing topological. structure of a 
B-network, and some useful techniques for analyzing complex 
B-networks are also presented. 


1. INTRODUCTION 


A class of interconnection networks called B-networks has 
been proposed as intercomputer communication networks (ICN) 
for multicomputer systems [1]. An n x n £-network is an 
interconnection network which provides connections from n 
input terminals to n output terminals and is composed of 2 x 2 
switching elements called B-elements. Each B-element can be 
set to one of two states, namely the "through" (T) state or the 
"cross" (X) state, corresponding to the two. possible 
permutations of its input terminals. The n computing units of a 
multicomputer correspond to both the n input terminals and the 
n output terminals. Hence, the n input links and the n output 
links of the B-network are considered to be identical and have 
been defined as the n terminal links of the B-network [1]. 


In a previous paper [2] a theoretical framework facilitating the 
fault-tolerance analysis of B-networks was developed. 
paper constitutes a sequel to that work. The analytical results 
from the earlier work are applied to the analysis of several well- 
known £-networks. These £-networks include the inverse 
shuffle-exchange network [3], the double-tree network [4], the 
indirect binary n-cube network [5], and the rearrangeable 
switching network [6]. 


Pertinent results from [2] are now summarized here. A fault 
model was specified which allows B-elements to be stuck in 
either of their two normal states, i.e., stuck-at-through (s-a-T) or 
stuck-at-cross (s-a-X). A new connectivity property called 
dynamic full access (DFA) was introduced which serves as the 
criterion for fault tolerance in B-networks. A B-network has the 
DFA property if each of its inputs can be connected to any one of 
its outputs via a finite number of passes through the B-network. 
A fault in a 8-network is a collection of 8-element stuck-at faults. 
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A fault is said to be critical if it destroys the DFA property of the 


f-network. A minimal critical fault is a critical fault none of 


whose proper subsets constitutes a critical fault. A 8-network 
with DFA is k-fault tolerant or k-FT if the failure, either s-a-T or 
s-a-X, of any k or fewer B-elements does not destroy DFA. The 
largest k for which a f-network is k-FT is called the 
fault-tolerance (FT) parameter of the 8-network. 


A graph model for analyzing 8-networks called a 8-graph was 
introduced in [2]. The labeled £-araph of a £-network is a 
labeled directed graph with vertices representing the 
B-elements, and edges representing the links of the B-network. 
An edge is labeled and called a terminal edge if it corresponds to 
a terminal link of the 8-network, otherwise it is not labeled and is 
called an intermediate edge. An unlabeled £-graph, or simply a 
8-graph, is a labeled B-graph with all its edge labels deleted. 
Figure 1 illustrates the labeled B-graph of a B-network called the 
8 x 8 inverse shuffie exchange (ISE) network [3] which connects 
eight computing units {0,1,... ,7}. Each computing unit is 
implicitly represented by a terminal edge in the B-graph. Usually 
the terminal edges are labeled with the indices of the associated 
computing units as depicted in Fig. 1. Each B-element in a 
B-network is modeled by a vertex with two incoming and two 
outgoing edges in the corresponding B-graph. A f-element 
stuck-at fault can be modeled by the splitting of the 
corresponding vertex into two subvertices, each with one 
incoming and one outgoing edge. Furthermore, it is easily seen 
that a B-network has the DFA property if and only if the 
corresponding B-graph is strongly connected. 
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Fig. 1. (a) The 8x8 inverse shuffle exchange 
(ISE) network; (b) Its labeled B-graph. 


Given ann xn £-network N, a connection of N is. a one-to-one 
mapping from the n input to the n output terminals, and can be 
represented by a permutation of n elements. The state of N is 


determined by the states of its z B-elements. If Ss = s(b) e{T,X} 
denotes the state of the B-element b, then a state of N is 
represented by a z-tuple s(B) = S(b,,b,,,..-,b,) = (s,, So, S,). If 
some of the @-elements are not specified, or their states are not 
of interest, then we can characterize s(B) by the partial state, 
s(B) = (S,, S,, 4S a where se{T, X,d}, and d represents an 
unspecified or dor’ t care state. A connection p of N ts realizable 
if there exists a state s of N such that by setting N to state s, the 
one-to-one mapping specified by p is established. It is possible 
that a connection of N can be realized by more than one state of 
N. 


A second network parameter based on the intercomputer 
communication delays was introduced in [7] as a measure of the 
performance of a B-network. This parameter d is obtained by 
considering the communication delays between all pairs of 
computing units and choosing the maximum or worst case value 
of these delays. The above definition was formalized in [7] by 
making use of the B-graph model of £-networks. The 
edge-distance, or simply distance, from edge i to edge j in a 
B-graph is the number of intermediate vertices in the shortest 
directed path having edges i and j as its first and last edges, 
respectively. The edge-diameter, or simply diameter, of a 
B-graph is the longest distance between any two edges of the 
B-graph. The communication delay (CD) parameter d of a 
fB-network is the diameter of its B-graph. 


The CD parameter d thus indicates the worst possible delay, 
measured in terms of the number of B-elements, between any 
pair of computing units in the multicomputer system. Meanwhile, 
the FT parameter k indicates the maximum number of 
B-elements in a B-network, whose failures, either s-a-T or s-a-X, 
do not destroy the DFA property of the network. In this paper, 
the FT and CD parameters are derived for several well-known 
B-networks. Section 2 presents a formal technique for 
describing the topological structure of a B-network, and some 
useful techniques for the analysis of complex {£-networks. 
Sections 3 and 4 contain the analysis of the inverse shuffle- 
exchange (ISE) network and the double-tree (DOT) network, 
respectively. It is shown that both the ISE and the DOT networks 
are non-fault tolerant, i.e., their FT parameters are k 0. 
However, modified versions of the ISE and DOT networks are 
presented which are fault tolerant. It is also shown that the 
shuffle-exchange (SE) network possess the same FT and CD 
parameters as that of the ISE network. Section 5 analyzes the 
indirect binary n-cube (nIBC) network. It is shown that the niBC 
network exhibits very desirable FT and CD parameters. It is 
further shown that the flip network used in the Staran SIMD 
parallel processor [8] and the omega network [9] also possess 
the same FT and CD parameters as the nIBC network. In Sec. 6 
the CD parameter and bounds for the FT parameter of Benes’ 
rearrangeable switching (BRS) network [6] are presented. A 
conjecture of the actual FT parameter is also included. 


2. GENERAL ANALYSIS TECHNIQUES 


A B-network is a collection of B-elements interconnected by 
fixed links. A B-element can be viewed as containing flexible 
links which can be programmed, i.e., set to certain states, to 
provide desired communication paths from the inputs to the 
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outputs of the B-element. This section develops a formal 
technique for concisely describing the topological structure of 
B-networks. Large and complex £-networks are _ typically 
constructed from smaller B-networks; the smallest being the 2 x 
2 B-element. Frequently, many identical subnetworks are 
connected to form a large network with a_ regular 
interconnection structure. Two very general interconnection 
methods for B-networks are now discussed. 


The dimension of an n x n B-network N is |N| n. By 
numbering the inputs and outputs from top to bottom, the set of 
inputs of N can be denoted by two ordered sets I(N) = {I grtos sl) 
and O(N) = (O,,0,,,....0,), respectively. In an interconnected 
multicomputer system the inputs and outputs of N coincide, 
hence we can say that I(N) O(N), which means that I is 
connected to O, for i 1,2,...,.n. B-networks with the same 
dimension can be connected to form a cascade or series 
network; we now define this concept formally. 


A B-network N is a cascade of f-networks N.,N,, 
denoted N=N,*N,*...*N., if NI = IN| = IN,I = IN |, and Aat 
= K(N,), O(N,) = K(N,),..,O(N. ,) = iN). O(N) = ONS, If all the 


subnetworks are identical, i.e., if N = = N , then we will 
write N = N¥. A cascaded Bnetwork and all its subnetworks 
must have the same dimension. Many well-known £-networks 


are cascades of other networks. 


Given two ordered sets of terminals, X = (X,, Xo way a) and Y 


= (V4, Younes Y the union of X and Y, denoted XUY, is “another 
ordered set of terminals Z = (Z,, Z5--42,), Such thatc = a + b, 
and Z. = X, fori = 1,2,...,a, and Z. = Yia fori=a+1,a+ 2,..., 
a+b. We can now define another interconnection method 
involving the vertical composition or juxtaposition of networks. 


A B-network N is a stack of B-networks N,, N,...., 
N=N, +N, +... +N), if [IN| = IN,[ + IN] + ... 
= I(N,) U (N,) U ... U KIN) and O(N) = 
O(N,). If all the subnetworks are identical, i.e., if N, = N, 
= N,, then we will write N wN 1" The above two 
interconnection methods, cascade and stack, can be combined 
in the construction of complex B-networks. 


N » denoted 
+ IN | and I(N) 
O(N,) U O(N,) U...U 


The interconnection topology of the n links in a B-network can 
be conveniently described by a permutation of its terminals. An 
nx n permuter aw is defined here as a network consisting of n 
fixed links connecting two sets of n terminals. The connections 
realized by the permuter 7 can be represented by a permutation 
of n elements thus 


t, ta he 
T = | 
a(t,) m(t,)...a(t,) 
where t, is connected to w(t) fori = 1, 2,....n. Ann xn permuter 


can be considered to be a degenerate or empty n x n B-network, 
and hence can be used as a subnetwork in the construction of 
large networks. 


Since a 8-network is composed of B-elements and fixed links, 
f-elements and permuters can be considered as the most 
primitive elements used in the construction of B-networks. The 
two interconnection methods, cascade and stack, can be 


defined as operators on the primitive elements. All B-networks 
of interest can be formed by applying the cascade (*) and stack 
(+) interconnections to a set of B-elements and permuters. 


Two £-networks are isomorphic if they have the same labeled 
B-graphs. The actual symbols used for the labeling are 
insignificant. There exist one-to-one correspondences between 
the B-elements, links and terminal links of two isomorphic 
B-networks. _ lsomorphic B-networks also have the same 
connecting capability and network structure. However, the 
diagrams representing two isomorphic B-networks may not look 
identical. They can be made to look identical by rearranging the 
positions of the B-elements without breaking and reconnecting 
any link. 


Every B-network has a unique (unlabeled) B-graph, but a 
B-graph can represent more than one B-network, depending on 
the labeling of its terminal edges. Hence an unlabeled 8-graph 
represents a class of B-networks all having the same unlabeled 
B-graph. We define two £-networks to be BG-equivalent 
(8-graph equivalent) if they have the same unlabeled B-graph. 


All the B-networks belonging to the same BG-equivalence 
class can be viewed as possible realizations of the same 


unlabeled B-graph. Each £-network corresponds to a specific. 


labeling of the edges of the B-graph. In a B-graph of z vertices 
and 2z edges, there are 2” distinct ways of labeling its edges. 
Hence the 8-granph represents a BG-equivalence class of gz 
distinct B-networks, not all of which may have _ practical 
significance. All the edges in the B-graph of a single-stage 
£-network are terminal edges. Hence each £-graph represents a 
unique single-stage B-network. 


We are interested in the fault tolerance characteristics of 
B-networks. These characteristics depend strictly on the 
structure of the 8-networks, and not on the computing units. It 
appears that a f-graph captures all the useful structural 
properties of a B-network, including connecting and switching 
properties needed for fault-tolerance analysis. Thus B-networks 
in the same BG-equivalence class have the same fault-tolerance 
properties. 


Let N be a B-network with z B-elements B = {b,,b,,...,b,}. A 
(partial) state of N, denoted s(B) = (Sj,S5--58,); is an assignment 
of each of the z 8-elements to the T, X, or d state, where Ss, 
e{T,X,d} derfotes the state of the B-element b, fori = 1,2,...,2. 
The B-elements which have been assigned the T or X states are 
called the specified B-elements. The residual network of a 
B-network N with respect to a (partial) state s, denoted N/s, is 
the B-network obtained from N by replacing all the specified 
B-elements of s by fixed links according to the specified states. 
The number of B-elements in N/s is equal to the number of 
unspecified B-elements in s. The residual network of N with 
respect to a completely specified state is simply a collection of 
links and is not of much interest. Figure 2a depicts a B-network 
N with seven B-elements B = {b,,b,,....0,} connecting eight 
computing units. The residual 8-network of N with respect to the 
partial state s(B) (d,d,d,d,X,X,T), denoted M ‘N/s, is 
illustrated in Fig. 2b. We can extend this concept to B-graphs. If 
G is the B-graph of a 8-network N, then the residual B-graph of 
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G with respect to a state s, denoted G/s, is the B-graph of the 


residual B-network N/s. Figure 2c and Fig. 2d are the B-graphs 


of the B-network of Fig. 2a and its residual network M = N/s of 
Fig. 2b, respectively. It can easily be seen that residual networks 
of a B-network are also legitimate B-networks. 


omnl 
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Fig. 2. illustration of a residual network: (a) A 
B-network N; (b) A single-stage B-network, 
M = N/s,s = (d,d,d,d,X,X,T); (c) B-graph of 
Fig. 2a; (d) B-graph of Fig. 2b. 


Many practical complex B-networks are constructed by 
systematically interconnecting a_ collection of smaller 
B-networks. By knowing the properties of the subnetworks and 
their interconnections we often can draw conclusions about the 
entire network. This is the approach taken here. 


Theorem 1: If two B-networks N, and N, are BG-equivalent 
then N, has DFA if and only if N, has DFA. 


Proof: Let G, (G,) be the unlabeled B-graph of N, (N,). A 
B-network has DFA if and only if its B-graph is strongly 
connected. Hence N, (N,) has DFA if and only if G, (G,) is 
strongly connected. Since N, and N, are BG-equivalent, G, 
must be identical to Go, and hence N, has DFA if and only if N 
has DFA. A 


This theorem tells us that DFA is independent of the specific 
labeling of terminal links, i.e., all links can be considered as 
intermediate or terminal links. The DFA property of a B-network 
can be checked by inspecting any other 8-network ‘in its BG-. 
equivalence class. 


Theorem 2: lf N, is a residual network of N with respect to a 
State s, i.e., N, = N/s, and N, has DFA, then N must also have 
DFA. 


Proof: All the terminal links of N still exist in N,. Since N, has 
DFA, there exists a connecting path between any pair of terminal 
links of N,. Since N, is a residual network of N, all connecting 
paths in N, exist in N. Hence there exists a connecting path 
between any pair of terminal links of N. Therefore N must have 
DFA. A 


Frequently, a residual network of a B-network has an obvious 
structure which facilitates fault-tolerance analysis. Theorem 2 
allows us to analyze the residual network and draw certain 
conclusions about the original network. Clearly, the converse of 
Theorem 2 is not true. Every faulty B-network is a residual 
network of the fault-free B-network. A critical fault produces a 
residual B-network which does not have DFA. A B-network has 
the full-access property if each input terminal can be connected 
to every output terminal via exactly one pass through the 
network [6]. The proof of the following Theorem is straight 
forward and is omitted. 


Theorem 3: Let the B-network N be a cascade of g subnetworks 
N,, N,N, ie, N = N)* N,*..."N_. The network N has full 
access if any one of the subnetworks N pNoenNg has full access. 

A 


Clearly any B-network N having full access must have DFA. 
The above theorem says that if N is a cascade of subnetworks 
then any one of the subnetworks having full access will 
guarantee full access and DFA for N. We use | to denote the 
identity connection 


in which every terminal t, is connected to itself via the network 
N. We say a network N contains the identity connection | if there 
exists a complete state s of N that realizes the identity 
connection. | will also be used to represent the residual network 
N/s. 


Theorem 4: Let N = N,*N,*...*N . If some N, has DFA and all 
the other N's contain the identity connection I, then N must have 
DFA. 


Proof: Assume that N. has DFA. Let N’ = N,*N,*...*N., and N" 
= Noe 7N so that N = N’*N*N". Since each N; for i 
1,2,....g, contains I, both N’ and N" must also contain I. Lets’ and 
s" be complete states of N’ and N", respectively, such that N’ /s’ 
= land N"/s" = |. Lets be a partial state of N that results from 
setting N’ and N" to the states s’ and s", respectively. Clearly 
N/s I-N; of es N.. Since N, has DFA, N/s must have DFA. 
According to Theorem 2, N must also have DFA. A 


The above theorem implies that if a B-network N has DFA and 
contains the identity connection |, then any cascade of multiple 
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copies of N must also have DFA. The foregoing results will be 
applied to the analysis of several well-known B-networks in 
subsequent sections. For convenience we assume in this paper 
that all B-networks have dimension 2™ where m is an integer, i.e., 
only 2 x 2™ B-networks are considered. 


3. INVERSE SHUFFLE-EX CHANGE NETWORKS 


Annxn £-stack Sis ann x n B-network consisting of a stack 
of n/2 B-elements. Many n x n B-networks contain a single 
stage or multiple stages of B-stacks.. A well-known 8-network 
called the n x n shuffle-exchange network or SE network, P, is 
the cascade of an n x n permuter and ann xn £B-stack [3]. Then 
x n permuter o used here, which resembles the perfect shuffling 
of a deck of cards, is called the perfect shuffle. If the terminals 
are numbered from 0 to n-1, then the perfect shuffle permutation 


0 1 n-1 


0(0) oa(1) a(n-1) 


can be defined as follows 
o(i) = (21+ |2i/n| )modn, fori = 0,1,...,9-1. 


The inverse of an n x n B-network N, denoted N", is another n 
x n B-network that is the same as N except that the direction of 
all the links is reversed. The input terminals become the output 
terminals, and vice versa. Clearly the inverse of an n x n 
permuter is represented by the corresponding inverse 
permutation. The n x n inverse shuffle-exchange network, or ISE 
network, is the inverse of the n x n SE network. Figure 1a 
depicts the 8 x 8 ISE network. Although N*N=N7, N*N’! is not 


always defined, unless N is a permuter. In fact if both N, and N 
are permuters, then the cascade operator * used in N,“N, 
becomes identical to the usual composition operator in 
permutation groups. It can be shown that (NY? = (Nf=N% 


For convenience, we restrict our attention to 2™ x 2™ ISE 
networks, where m is an integer. Each B-element in an ISE 
network can therefore be designated by an m-bit binary number 
b b.,.4-:0,, where be {0,1}. The top £-element is designated 
00...0 and the bottom f-element is designated 11...1. Following 
the same convention, all the 2n links can be labeled from top to 
bottom by (m+ 1)-bit binary numbers b m2 m1") starting from 
00...0 and terminating with 11...1, as illustrated in Fig. 1a. 


in the above labeling scheme each vertex b_b_ ,...b, has two 
incoming links labeled b-b,0 and DD, 1, and two outgoing 
links labeled Ob, ...b, and 1b_...b,;. When B-elementb b_ ....b, 
is in the T-state, connections are established from b,,,0 to 
Ob,,.-b, and fromb_...b,1to 1b. ...b,. If itis in the X-state, these 
connections are reversed. The labels for B-elements can be 
translated directly into B-graphs to identify corresponding 
vertices. The binary (m+ 1)-tuple labels for B-network links can 
be used to label edges in the B-graph and, thereby implicitly 
identifying the computing units. The labeled B-graph of the ISE 
network of Fig. 1a is shown in Fig. 1b. 
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Fig. 3. The 8 x 8 omega network. 


It has been shown that Pease’s indirect binary m-cube is 
isomorphic to the omega network, which is actually a cascade of 
m stages of the 2™ x 2™ ISE network [10]. For example, the 2° x 

3 omega network in Fig. 3 is a cascade of three identical 8 x 8 
ISE networks. We know from Pease’s work [5] that the indirect 
binary m-cube has the full access property, that is, every input 
terminal. of the network can reach any output terminal via one 
pass through the network. By doing a_ space-to-time 
transformation, the i” stage of the indirect binary m-cube can be 


mapped onto the i” pass through the 2™ x 2™ ISE network. 
Hence if an input terminal of an indirect binary m-cube can reach 
any one of the output terminals in m stages, then any input 
terminal of the 2™ x 2™ ISE network should be able to reach any 
other terminal within the distance m. The communication-delay 
parameter d of the 2™ x 2™ ISE network must therefore be m or 
less. In other words, for the 2™ x 2™ ISE network, d < m. It has 
been shown in [7] that, given a B-network of n B-elements the 
lower bound for its CD parameter d is log,n| +1. Since the 
2™ x 2™ ISE network has 2™' B-elements, the lower bound for its 
d must be |log,(2™')| +1 = m.Hence the CD parameter of 
the 2™ x 27 ISE network must bed = m. It is easy to see that the 
2™ x 2™ ISE network is 0-FT. Both the top and bottom 
B-elements contain selfloops which constitute single critical 
faults [1].. The foregoing discussion leads to the following 
theorem. 


won 
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Theorem 5: The FT and CD parameters of the 2™ x 2™ ISE 
network are k = Oandd = m, respectively. A 


The minimal communication delay of ISE networks makes 
them very desirable for systems requiring very | fast 
communication. In addition, ISE networks require very simple 
control algorithms [5]. Clearly, a serious drawback of ISE 
networks is their lack of fault tolerance. We now propose a 
modified ISE network which is fault tolerant and still possesses 
the minimal communication delay. 


The 2™ x 2™ modified ISE network, or MISE network, is the 2™ 
x 2™ ISE network with two of its links altered as follows. The top 
output from B-element 00...0 is connected to link 11...1 instead 
of to link 00...0. Similarly, the bottom output of B-element 11...1 
is connected to link 00...0 instead of to link 11...1. Basically, in 
the MISE network, the destinations of the two original self-loop 


links are exchanged. Figure 4 illustrates the 23 x 23 MISE 
network. It can be shown that the 2™ x 2™ MISE network is 1-FT 
and still possesses the same minimal CD parameter of d 
m. This result is stated in the following Theorem, whose proof is 
documented in [7]. | 


Theorem 6: The FT and CD parameters of the 2™ x 2™ MISE. 
network arek = 1 andd = m, respectively. A 
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Fig. 4. (a) The 8 x 8 MISE network; 
(b) Its B-graph. 


The MISE networks are fault-tolerant B-networks with minimal 
communication delay. We have thus synthesized a fault-tolerant 
B-network by modifying a non-fault-tolerant B-network. This.was 
accomplished without adding extra B-elements or increasing 
communication delay. The simple control algorithm used for ISE 
networks needs to be modified only very slightly for the MISE 
networks [1]. The £-graph of the 2™ x 2™ SE network is 
isomorphic to that of the 2™ x 2™ ISE network. Hence they both 
possess the same FT and CD parameters. Same is true for the 
modified SE network and the MISE network. 


4. DOUBLE-TREE NETWORKS 


The double-tree network was first proposed by Levitt, Green 
and Goldberg in their study of a class of B-networks called CPCU 
(complete permutation-complete utilization) networks [4]. A 
double-tree network consists of a right and a left "half." Each 
half of the network resembles a binary tree. The left and right 
trees are mirror images of each other, and each pair of mirror- 


‘image 8-elements is connected by a link. Figure 5 illustrates the 
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x 2° double-tree network. The double-tree network has been 
investigated by three different research teams for three very 
different applications. Levitt et al designed the double-tree 
network as a fault correcting network. By cascading a double- 
tree network with a CPCU network, a single-fault correcting 
CPCU network is obtained. Hopper and Wheeler [11] 
considered using the double-tree network as a packet switching 
network for local computer networks. The double-tree network 
was one of two £-networks considered by Leung and Dennis 
[12] for implementing the distribution network of the MIT Data 
Flow Processor. 


Left Half Right Halt 
——— — ——— a 


Fig. 5. The 2? x 2° double-tree network, 
or DOT network, D,. 


The structure of a 2™ x 2™ double-tree network, or DOT 


network, denoted D_, can be defined recursively as follows. The 
B- element is defined as the 2' x 2' DOT network. The 2™x 2™ 
DOT network D_ is obtained by cascading a stack of 2™ q 
B-elements to the input side and a stack of 2™! B-elements to 
the output side of the 2™' x 2™' DOT network D__, according to 
the following construction rules; see Fig. 6. 


; 


a a ee oe 


(a om- | 
Fig. 6. The general structure of the 2™x2™ 
DOT network D..: 
1. Assign the labels b,,b,,....,m-1 (b,,b,, .,m-1) to 


the B-elements to be cascaded to the input (output) 
side of D_ . 


2. Connect one output of the new B-element b; to input 
link | of D,, forj = 1,2... ee 


3. Connect one input of the new es -element b, to output 
line O, of D, forj = 1,2,.. sou, 


4. Connect the remaining output of b. to the remaining 
input of b.. 


The inputs of the (b.)’s and the outputs of the (by s are the inputs 
and outputs, respectively, of D: 
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In general, D_ has 2™+1.3 B-elements, and has 2m-1 stages 
of B-elements, with stage i and stage 2m-i each having exactly 
2™' B-elements, fori = 1,...,m. For example, D, as shown in Fig. 
5 has 5 stages and stages 1, 2, 3, 4 and 5 have 4, 2, 1, 2 and 4 
fB-elements respectively. The single B-element in the middle, i.e., 
stage m of D_, is called the center B-element, and is denoted b.. 
Based on the construction of D_, a vertical symmetry and a 
horizontal symmetry can be identified in the network structure of 
D.. The left half and the right half of D., are symmetrical with 
respect to a vertical axis passing through b,. Furthermore, the 
upper half and the lower half of D are symmetrical with respect 
to a horizontal axis passing thorugh b.. 


A DOT network is a multiple-stage B-network. Unlike a single- 
stage network, not every link of a multiple-stage network is a 
terminal link. Hence, not every edge in its B-graph is a terminal 
edge. The CD parameter d denotes the longest distance 
separating any pair of edges in the B-graph. Since the terminal 
edges represent computing units, for a multiple-stage network, it 
is useful to also consider the longest distance separating any 
pair of terminal edges in the B-graph. Hence, we define the 
terminal delay (TD) parameter t of a B-network as the longest 
distance between any pair of terminal edges in its B-graph. 
Consequently the TD parameter indicates the actual worst case 
communication delay between any two computing units. For 
single-stage networks t = d. Clearly, the terminal delay of a full- 
access £-network with t stages is equal to t. 


Lemma 1: The 2™ x 2™ DOT network D_, has TD parameter t = 
2m-1 and CD parameter d = 4m-3. 


Proof: Every input terminal of D ‘, can reach the center 
B-element b., and b, can reach every output terminal. Hence 
D_, has full access. Since D. has 2m-1 stages, the TD 
parameter must be t = 2m-1. 


Let the two input links of b. be a, and a,,, and the two output 
links of b, be e, and e,,. The links e, and e, can reach every 
terminal link of D_, via exactly m-1 B-elements, and can reach 
every link of D_, via 2(m-1) or fewer B-elements. On the other 
hand, every link in D_ can reach e, and e, via 2m-1 or fewer 
fB-elements. Hence the distance between any two links is at 
most 2(m-1) + 2m-1 = 4m-3. 


To prove that d = 4m-3, we must show that there exist two 
links in D_ separated by the distance 4m-3. The distance from 
e, back to a, is equal to 2m-2, and the distance from a, to Ay is 
equal to 2m-1. Since the upper and lower halves of dD. are 
joined only by the center B-element b,, the shortest path from e 


to a, must include a. Hence the ‘distance from e, to a, is 
2m-2+2m-1=4m-3. Therefore the CD parameter of D., isd = 
4m-3. A 


It is easily seen that the center B-element b, of D__ is critical. If 
b. is s-a-T, then the network will be split into disjoint upper and 
lower halves. Hence the FT parameter of D._ is k 0. To 
summarize the foregoing results we have the following theorem. 


Theorem 7: Let D_, denote the 2™ x 2™ DOT network. The FT 
parameter of D_ isk = 0. The CD parameter of D., isd = 4m-3. 
The TD parameter of D. ist = 2m-1. A 


We now propose a modification of the DOT network to make it 
fault tolerant. We know that the center B-element b. is Critical; it 
can easily be shown to be the only critical B-element. We can 
remove the only single critical fault of b , 8°a-T by simply deleting 
the center B-element b>. The modified DOT-network, or MDOT 
network, x, is identical to the DOT network D. except that the 
center B-element b_ is permanently set to the X-state. Figure 7 
illustrates the 2° x 2° MDOT network. 
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Fig. 7. The 2°x2° modified DOT network, 
or MDOT network, X;. 


Like the original DOT network, the MDOT network has full- 
access. However, the number of stages, and hence the TD 
parameter, has been reduced by one to t = 2m-2. Using the 
argument given in the proof of Lemma 1, it can be shown that the 
CD parameter for the 2™ x 2™ MDOT-network is d = 4m-4, which 
is also one less than the CD parameter of the DOT network. 
Thus the MDOT network actually has better delay characteristics 
than the DOT network. It can be shown that the smallest minimal 
critical faults (MCFs) of the 2™ x 2” MDOT network consist of 
pairs of mirror-imaged B-elements being stuck at the same state. 
Hence, no single critical fault exists and the 2” x 2™ MDOT 
network is 1-FT. We have the following theorem, a formal proof 
of which can be found in [1]. 


Theorem 8: Let X_, denote the 2" x 2™ MDOT network. The FT 
parameter of xan isk = 1. The CD parameter of xa, isd = 4m-4. 
The TD parameter of X_ ist = 2m-2. A 


Interestingly, we have succeeded in modifying a non-fault- 
tolerant B-network to make it 1-FT, while decreasing its 
communication delay. As might be expected, the connecting 
capability of the MDOT network, in terms of the number of 
permutations that can be realized, is slightly less than that of the 
corresponding DOT network due to the absence of the center 
fB-element. An alternate modification which does not sacrifice 
any connecting capability is simply to add a_ redundant 
B-element b, in tandem to b, to correct the s-a-T fault of b.. 
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5. INDIRECT BINARY m-CUBE NETWORKS 


Various "cube" structures have been proposed for 
interconnecting large numbers of processors in computer 
systems. The binary m-cube structure has been frequently 
considered [5,13]. This network may be thought of as 
interconnecting 2™ processors which are placed at the vertices 
of an m-dimensional cube. Each edge of the cube represents a 
link connecting two processors, hence the name "binary" m- 
cube. One processor can be designated as the origin with an 
m-bit binary address 00...0. Other processors can then be 
identified by their corresponding coordinates in the m.- 
dimensional space. The binary m-cube has a regular structure 
and is relatively simple to control. The average communication 
delay between any pair of processors is small. 


Fig. 8. The general structure of the mIBC 
network C | 


The indirect binary m-cube, or mIBC network, denoted by C. 
is a2™ x 2™ B-network which is defined recursively as follows. A 
B-element is a 1IBC network. An mIBC network for m > 2 is 
constructed from two (m-1)IBC networks and a 2™ x 2™ B-stack 
according to the following equation 


- ; * ek Oe 1 
CC =(C + C, )fa*S*a’, 


where o is the perfect shuffle permuter, Sis a B-stack, and ois 
the inverse perfect shuffle permuter. As before, the operator * 
denotes cascade or composition of permutations. Figure 8 
shows the general structure of C_, The indirect binary 3-cube 
C, is illustrated in Fig. 9. 
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Fig. 9. The 3IBC network C,,. 


The interconnections provided by the mIBC network emulate 
those of an m-dimensional binary cube [5]. The mIBC network 
has m stages of B-elements with 2™' B-elements per stage. The 
total number of £-elements in this network is m2™' There are 
2™ terminal links which correspond to the corners of the binary 
m-cube, while the B-elements correspond to the edges of the 
binary m-cube. The B-elements in any stage correspond to all 
the edges parallel to one of the axes. A B-element set to the 
X-state corresponds to a traversal of that edge in the binary m- 
cube. A simple algorithm, similar to those of the ISE and MISE 
networks exists for the control of the mIBC networks [5]. 


Two other well-known £-networks are actually isomorphisms 
of the mIBC network. The 2™ x 2™ flip network used in the 
Staran SIMD parallel! processor manufactured by Goodyear 
Aerospace is structurally isomorphic to the mIBC network [8]. 
The two networks differ only in their control schemes. Unlike the 
mlIBC network, in which each B-element can be individually 
controlled, there is only one control line for each stage of 
B-elements in the flip network. All the B-elements in the same 
stage are set to either the T-state or the X-state simultaneously, 
to accomplish either the "shift" or the "flip" operation. Lawrie 
devised a B-network called the omega network for accessing 
and aligning data in an array processor [9]. The omega network 
is typically placed between a set of processors and a set of 
memory modules. Processors access data in the memory via the 
omega network. An inverse omega network is an omega 
network in. which the direction of all the links are reversed. It has 
also been shown that the 2™ x 2™ inverse omega network and 
the indirect binary m-cube are structurally isomorphic [10]. 


It is well-known that the mIBC network is a full access 
f£-network, hence its terminal delay t is m, the number of stages. 
in fact there exists a unique path of length m from each terminal 


link to any other terminal link. Since a terminal link a, can reach 
any other terminal link in distance m, a, must be able to reach 
any link, terminal or intermediate, within the distance 2m-1. We 
now show that the CD parameter d of the miIBC network is 2m-1. 


Two BG-equivalent B-networks have the same communication 
delay. A BG-equivalent of the mIBC network can be obtained by 
cyclically rotating the stages, i.e., by replacing stage 1 by stage 
'm, stage 2 by stage 1, etc. Since the mIBC network is 
isomorphic to the inverse omega network, which is a cascade of 
identical stages, any cyclic rotation will produce an isomorphic 
B-network. Consequently, the links in any stage can be made 
the terminal links by the cyclic rotation. Hence any link whether 
terminal or intermediate can reach any other link within the 
distance 2m-1. In one pass an input terminal link can reach all 
the output links of the m™ stage, but only half of the input links to 
the m” stage. The unreached links can be reached in a second 
pass. Consequently, there exist links separated by the distance 
2m-1, hence the TD parameter t of the 2™ x 2 mIBC network is 
m and the CD parameter d is 2m-1. 


The recursive structure of the mIBC network suggests that we 
can determine its fault-tolerance parameter by induction. For 
this purpose we first develop several lemmas. 


Lemma 2: Let C, be an miBC network. By setting all the 
B-elements in the min stage to the X-state we obtain a residual 
network C_ having the structure illustrated in Fig. 10. 
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Fig. 10. A residual network GC. of the 
mIBC network C_. 


Proof: The structure of C m 'S defined in Fig. 8. We need to show 
the permuters o and o”' and the f-elements in the m" stage of 
C_, combine to form the permuter y of Fig. 10. If we label the 
links in a stage from top to bottom with the address numbers 


0,1,...,2-1, then the permuter y can be defined by the following 
permutation: 


y(a) = a+ 2™' (mod 2%) 


where a e{0,1,...,2™-1} is the address of a link. If we represent a 


by the equivalent binary number a_a_ ,...a,, then we can write 


yaa 8) = aa 4a, 


where a =O (1) if a.=1 (0). Effectively, the permuter y 
connects the i” output of the upper C__, to the i” input of the 
lower C ae and vice versa. The perfect shuffle o and the inverse 
perfect shuffle o' can be similarly defined as follows: 


ere | 


OA yy) = Ay gy 


an 
and 


1 m9" 


o effectively corresponds to a. cyclic left shift of the binary 
address ak PO and ao”! corresponds to a cyclic right shift of 
this address. 


-1 - 
o (aa. 44,) = a 


The permuting effect of a B-element can also be defined in 
this way. Let the permutation realized by a £-element in the 
T-state be denoted by E' and that realized by a B-element in the 
X-state be denoted by E*. Then 


T = 
E (aa 484) = an gy 
and | 
EX(a_a a)=aa_....a, 
mom-1°°°" 4 m-=m-1°°""1 
Now 
* X » -1 = X » -1 
o*E'*o (aa, 4...a,) = E’*o (a, ,...a,a,) 
=o '(a_...aa_) 
m-1°" 1m 
= HR By 


= y(a a. 4a,) 


from which it follows that o*E**a"! = y. 


Corollary 1: The residual B-network C _ of Lemma 2 is BG- 
equivalent to a B-network C"_ obtained by cascading two 
(m-1)IBC networks Ci. and C2 A 


m-1° 


If, instead of setting all the B-elements in the m' stage to the 
X-state, we set them to the T-state, another interesting residual 
network is obtained. 

otE! T 
E'*g'| (a, Ant" a,) = E'*o 


aa). 


olan tm 


or 
=o (a, ,- 


a, a.) 


e arte i Be a, 


Hence o*E!*o! = |, where | is the identity permuter. This leads 
to the following lemma. 


Lemma 3: Let C. be an miBC network. By setting all the 
B-elements in the mh stage to the T-state we obtain a residual 


network C* which is a stack of two (m-1)IBC networks c! m1 2 and 


m-1" 


will be useful in 
As shown in Corollary 1, 


The two residual networks C., and C* 
computing the fault tolerance of C. 
Co, is essentially a cascade of two (m-1)IBC networks, while 
c* is a stack of two (m-1)IBC networks. Since the mIBC 
network has loops of length m, and hence has elementary 
circuits of length m in its B-graph, its fault tolerance k must be 
less than m. We show next by induction, that k is indeed m-1. 


Lemma 4: The FT parameter k of the miBC network C_ 
form > 2. 


is m-1, 


Proof: The 2IBC network is clearly 1-FT because its B-graph 
contains a Hamiltonian circuit and no self-loop [1]. It remains to 
be shown that for m > 2, if C,,.1 iS (m-2)-FT, then C_ must be 
(m-1)-FT. 


Let f be any set of m-1 faulty B-elements in C Since C_ has 
m stages, there must exist a stage which does not contain any 
faulty B-element. We can assume without losing generality, that 
this fault-free stage is the m' stage. Hence all the faulty 
B-elements of f are in the first m-1 stages, i.e., they are contained 
in the two subnetworks Ci and C. of C.. Since all the 
B-elements in the m! " stage are fault- free, they. can all be set = 
the X-state or the T-state to obtain the residual networks C’ 
c* , respectively. We want to show that the C. ah 
containing the fault f or, equivalently, that the residual network 
C_/Sp where S; is the partial state representing the fault f, still 
has DFA. 3 


Case 1: (One subnetwork is fault-free) All m-1 faulty B-elements 


of f are in one subnetwork, say Cle The other subnetwork Cc. 


must then be fault-free. We know that a fault-free IBC ee 
has full access. Hence Cc , Must have full access. Since C'. 
BG-equivalent to.a cascade of the two subnetworks crelen 
1), it can be concluded from Theorem 3 that C’ m/ 3+ must still 
have full access. Since C’ mS has DFA and "Is q residual 
network of C,/S5 by Theorem 2, Cc. /s, must also have DFA. 
Therefore C. is (m-1)-FT. 


Case 2: (Both subnetworks are, faulty) The m-1 faulty B-elements 
of f are distributed in both Cl, and C* ,- In this case each 
subnetwork can contain at most m: -2 faulty ‘B- nore Went Since by 
assumption Ci and C. are (m-2)-FT, then both cy lS, and 


C _,/S, must still have DFA. The upper and lower halves, C.. 
ane ;, of the residual network C*_ are disjoint. Since both 
cl 44 and C2 m.1/ 5, have DFA, a link in the upper (lower) half of 
Cc “7s, can reach all the links in the upper (lower) half of C* m/ Se 
For. a link in the upper (lower) half to reach all the links in the 
lower (upper) half, we can set the fault-free B-elements in the mh 
stage to the X-state to obtain C’ / Sy Consequently any link in 
C_/S, can reach any other link in C_/S- Hence C/S must 
have DFA and C_ must be (m-1)-FT. 


Theorem 9: Let C mn denote the 2™ x 2™ miBC network. The FT 
parameter of C_ is k = m-1. The CD parameter of C. isd = 
2m-1. The TD parameter of C_ist =m A 


Unlike many of the B-networks presented earlier, mIBC 
networks have FT parameters which are not a constant, but a 
function of the size parameter m. mIBC networks exhibit the best 
FT and CD combination of the B-networks considered so far. If 
the number of B-elements of C_ is denoted by n, then the fault 
tolerance of C is approximately log,n and the communication 
delay is approximately 2log,n. 


6. BENES’ REARRANGEABLE NETWORKS 


One of the earliest studies of connecting networks was 
performed by Clos on nonblocking switching networks [14]. 
Clos presented a class of nonblocking networks, now called Clos 
networks, consisting of three stages. of crossbar switches. 
Benes presented a special class of the three-stage Clos network 
and showed that it is rearrangeable [6]. A network in this class 
has 2 x r input and 2 x r output terminals and consists of only 
square crossbar switches. The first and the third stages each 
contain r 2 x 2 crossbar switches, or B-elements, while the 
middie stage contains two r x r crossbar switches. It has also 
been proven that this network can be further decomposed by 
replacing each of the r x r crossbar switches in the middle stage 
by another three-stage rearrangeable network of the same 
structure, as illustrated in Fig. 11. This process can be 
continued until all the square crossbar switches in the network 
are B-elements. The resultant network is a rearrangeable 2™ x 
2™ B-network consisting of 2m-1 stages of B-elements. We refer 
to this B-network as the Benes’ rearrangeable network or BRS 
network. The fault-tolerance properties of these BRS networks 
are the topic of this section. 
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Fig. 11. Decomposition of Clos’s rearrangeable 
network as presented by Benes. 


The smallest 2 x 2 BRS network is the B-element. The 2™ x 2™ 
BRS network B_ for m > 2 is defined recursively as follows. 


4 
B= S,,*0"'*(B,,,+8,,,)*0°S., 


where S_ denotes the 2™ x 2™ B-stack, o denotes the perfect 
shuffle permuter, and B__, denotes the 2™1 ¥ 2-1 BRS network. 
Figure 12 depicts the structure of the 2™ x 2™ BRS network. The 
2° x 2° BRS network B, is itlustrated in Fig. 13. The general BRS 
network B. has 2m-1 stages of B-elements and is symmetrical 
with respect to the middle stage. By the definition of 
rearrangeability, this network is capable of realizing all 2™ 
possible connections. The BRS network is then the most 
powerful £-network, in terms of the connecting capability, that 
we have considered so far. It is also the most difficult to analyze. 


" 


Fig. 12. The general structure of the 2™x2™ 
BRS network B on 


BRS networks have been extensively studied by many 
researchers, including Joel [15] and Opferman and Tsao-Wu 
[16]. Previous work has focused on the analysis of the network 
complexity, and implementation of efficient algorithms for 
network control. Opferman and Tsao-Wu have also studied the 
diagnosis of faulty B-elements which are stuck at the T- or the 
X- states. They assume that the state of each individual 
B-element is not accessible, and have shown how to derive a 
very small set of connections, or permutations of the terminals, 
which correspond to an efficient set of test patterns. 


As shown in Fig. 12,. the 2” x 2™ BRS network B_ contains a 
stack of two 2™! x 2™1 BRS networks. We denote the upper 
one by B! , and the lower one by B® ,. Let S, denote the stack 
of B-elements in the i” stage of B m Whose stages are numbered 
1,2,...,.2m-1, from left to right. Consequently, S - denotes the 
middle stage, and the network B. is symmetrical with respect to 
S,, We call the subnetwork to the left (right) of Sin the left (right) 
"half" of B_. Hence the network Bis the cascade of the left 
half, the middie stage Say and the right half as illustrated in Fig. 
13. Wu and Feng designed a full-access B-network called the 
baseline network [17] which is isomorphic to the cascade of the 
left half of B_, and the middle stage Sin’ 


Since the BRS network is rearrangeable, it clearly has full 
access. Hence its terminal delay t is the number of stages in the 
network, which is 2m-1. The communication delay parameter t is 
much more difficult to derive than the terminal delay. | 
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Lett Half Right Half 


Middle Stage 


Fig. 13. The 2° x 2° BRS network, By. 


Lemma 5: The CD parameter of the 2™ x 2™ BRS network B, is 
d = 4m-3. 


Proof: The baseline network has full access, hence a terminal 
link of B can reach all the links in the right half of B in one 
pass. It can reach any link of B_ in asecond pass, or within the 
distance d, = (2m-1) +(m-1) = 3m-2. 


The middle stage S_ cascaded with the right half of B 1 iS 
isomorphic to the inverse baseline network. It has been shown 
that the inverse baseline network is isomorphic to the baseline 
network [17], so S_, cascaded with the right half of B. also 
possesses full access. Hence any non-terminal link in the left 
half of B_ can reach all the terminal links within the distance d, 
= 2m-2, and can reach any link of B_ within the total distance d 
= d,+(2m-2) = 4m-4. Similarly, any non-terminal link in the 
right half of B_, can reach at least two terminal links within the 
distance d, = m-1, and hence can reach any link of B, within 
the total distance d, = d at, = 4m-3. Since d, > d,>d,, every 
link must be able to reach any other link within the distance d, = 
4m-3. Hence the communication delay d of the BRS network B . 
is at most d,. lf we can show there exist two links separated by 
the distance 4m-3, then the communication delay parameter 
must be exactly d, = 4m-3. 


Let b denote a B-element in the upper half of the middle stage 


Sa Let k, denote the upper outgoing link of b and k. denote the 
lower incoming link of b. Because the structure of the BRS 
network, an upper outgoing link of a B-element in San can only 
reach the upper half terminal links of B_ in one pass through the 
network. Hence after the first pass, KS can only reach all the 
upper terminal links. Furthermore, in a second pass k_ cannot 
reach any of the lower incoming links to the B-elements of Sin! 
see Fig. 13. Only in a third pass can ky reach k.. Hence the 
distance from kK, to k is equal to (m-1) + (2m-1) + (m-1) = 4m-3 
d,;. Therefore, the communication delay parameter d of the 
2™ x 2™ BRS network is 4m-3. A 


Pood 
—_ 


Some upper bounds on the fault tolerance parameter k of B 
can be easily obtained. Let the 2™ terminals of B_ be numbered 
0,1,....2™-1 from top to bottom. If the top (bottom) B-element in 


each stage is set to the T-state, a critical fault consisting of 2m-1 

_B-elements results, which isolates the terminal 0 (2™ -1). Hence 
the fault tolerance parameter k of the 2™ x 2™ BRS network must 
be less than or equal to 2m-2. 


ae 2™ x 2™ BRS network contallis two major subnetworks 
, and B . Each of these 2 
ottitaing two ‘om “2 x 22 BRS networks, etc. We call the terminal 
links 0,1,....2%'-1 of B_ its upper terminal links, and call the 
remaining links 271), 2m. 1 the lower terminal links. In the left 
half of B_, any upper owen terminal link can only reach those 
links which are upper von) input links of the subnetworks of 
B_. Consequently, if all 2™1 B.elements of the middle stage S a 
are set to T, the upper and lower terminal links of B., are 
disconnected, and B_ is decomposed into two Identical 
subnetworks. Hence another upper bound for k is 2™1.1. 
Combining the two upper bounds we can conclude that the FT 
parameter k of the 2" x 2™ BRS network B_ is bounded above 
by min{2m-2 , 21.4}. 


It is conjectured that k is actually equal to this upper bound. 


Conjecture 1: The 2™ x 2™ BRS network B_ has FT parameter 
k=min{2m-2, 271.1}, A 


For m ¢ 3 the above conjecture is known to be true. If it is true 
form = 4, then the conjecture can be proven inductively using 
the similar approach as that of Theorem 9. 


Theorem 10: Let B_ denote the 2] y 2™ BRS network. The FT 


parameter of B_ is k < min{2m-2 , 2™'-1}. The CD parameter of 
B., isd = 4m-3. The TD parameter of B, ist = 2m-1. A 


BRS networks appear to have fault tolerance and 
communication delay similar to those of mIBC networks. If the 
number of B-elements of B_ is n, then the fault tolerance of B_ 
is approximately 2log,,n, and the communication delay is 
approximately 4log on. BRS networks appear to be fault tolerant 
with respect to full access as well. 
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A FAULT-TOLERANT CONNECTING NETWORK FOR 
MULTIPROCESSOR SYSTEMS 
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10129 TORINO - ITALY 


This paper presents a new interconnection network, 


referred to as F-network, which is able to correc 
tly handle the communications between the connec- 
ted devices, even if some nodes within the  net- 
work are faulty. The routing algorithm presented 
in this paper provides a fast procedure for rero- 
uting a message; hence, the redundant paths can 
also be used to enhance the network bandwidth. It 
shown that the rerouting properties are 
still valid when broadcasting is used. Analytical 
models show that the MTBF of a F-network is un- 
reacheble by using several non-fault-tolerant net 
works in parallel. Finally, this paper presents 
the modularity properties of the F-network, which 
lead to a LSI or VLSI implementation cheaper than 


for a pair of parallel delta networks. 


is also 


1. Introduction 


One of the most promising approaches to the imple 
mentation of large multiprocessor systems is ba- 
sed on the use of special switched networks, con- 
necting the processors with themselves and/or 
with the memory banks. In the past few years, ma- 
ny papers on interconnection networks have appea- 
red in the literature [1] ~ [7 | . Almost all 
of them deal with the functional properties, the 
performance issues and the implementation of the 
networks discussed. 

Little attention has however been paid to the 
fault-tolerance capabilities of interconnection 
networks. Fault-tolerance can be achieved in seve 
the 
use of self-checking and correcting codes for the 
data transmitted through the network and the 


fault tolerant design of network switches and 


ral ways; the most classical methods include 


control units. These techniques rely directly on 
the network implementation. A third way consists 
in providing multiple alternative paths for the 
transmission of messages; in this case, the net- 
work topology and the routing and re-routing algo 
rithms greatly influence the network fault-tole- 
rance. 

When the latter approach is used to enhance _ the 
fault-tolerance, the following criteria may be u- 


sed to judge the network design: 
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the mean life time of the network should be 
substantially higher than in the other net — 
works, and it should be better than that achie 
ved by duplicating or multiplying other net- 
works}; 

2) the control algorithm should be as simple as 
possible; furthermore, dynamic rerouting should 
be used, since it allows simple recovery proce 
dure and increases the network performances; 
the network should be able to perform all the 


switching functions performed by the commonest 


3) 


networks, without any penalty; 
4) the number of active devices required for imple 
menting the new network should be kept as low 
as possible. 
In fiz] a multiple path routing scheme for the ba- 
nyan networks is presented; however it is not well 
suited for circuit switching, hence it partially 
violates criterion 3. In [8] = [io], the rerou- 
ting capabilities of the ADM and IADM networks are 
studied; such networks are not able to reroute e- 
very message, hence they violate criterion 2. This 
paper presents a new network, referred to as’- the 
F-network, which performs well with respect to all 
the criteria listed above. Furthermore, it is mo- 
dular, hence it is suitable for low-cost LSI or 
VLSI implementation. 
In section 2, the F-network definition is presen- 
In 


section 3, the rerouting capabilities of the F-net 


ted and the routing algorithm is formulated. | 


work are shown, when broadcast communications are 
used. 

In section 4, analytical reliability models for 
the F and multiple delta networks are presented, 
and the results obtained are discussed. In section 
5, the modularity properties of the F-network are 
illustrated and their impact on the implementation 
costs are discussed. 


2. F-network definition 


The basic element of the network presented in this 
paper is a switch with 4 inputs, 4 outputs and ca- 
pacity 1. A. block diagram of a single switch is 

shown in Fig. 1. Although this diagram can be used 
as a suggestion for implementation, it is provided 
hence only to explain the switch behaviour clearly. 


R =(P eee. ° Ps e P, ° ery 
jtL.. jyael’ aa og Celie fees a a’ 5,0” 
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It is worth noting that the F network is a super- 
set of the binary cube network; in fact, the for- 
mer can emulate the latter, by using only the con 
nections to P.,1 and Cae The F network with N=8 
is shown in Fig.2. 

Although the network topology seems very complica 
ted the routing algorithm is very simple. In fact 
the F network belongs to the "digit controlled" 
class of networks [4] ; that is, the routing at 
Fig.l. Block diagram of the switch used. each stage is performed only on the basic of a 
single digit within a routing tag. In our case, 


vue arcemae 


M 
U 
L 
T 
l 
p 
L 
E 
X. 


Since the message entering a node can be routed 


Let N be the number of input and output devices to one out of four outputs, the routing tag 
and n be equal to logoN.The the whole F network T(E hs45-40es Eg) is composed of n four-valued di- 
is constituted by n+l stages composed of N nodes. gits t: (O<j<n). The four possible values of t; 
While each node in a middle stage is a _ full are O, 1, 2, 33 the choice of the set of the va- 
switch, the nodes located in the stages O and n lues for t; is merely conventional. The path-fin- 
are constituted by only the right or left half ding process of a message is based on the use of 
of a full switch. Hence, the overall complexity, a special function f defined below. 

a NeoE DN times the complexity of the swicth shown Definition 2.2. 

in Fig.l. ae ee ee 

The nodes within each stage are numbered from 0 The function f (P , t;) accepts a string of bits 
to N-l and from top to bottom, while the stage are P and a four-valued digit tj and produces a 
numbered from O to n and from left to right.The string of bits (i.e. a number) defined by the fol 
input devices are connected to the nodes in stage lowing relation. 


O and the output ones are connected to the nodes 

in stage n. The two sets of input and output devi 

ces may or may not be coincident. In the rest of 

this paper, a node of the network will be refer- 

red to as Pj, where P (O< P<N) indicates the num 

ber within the stage and j (O<j<n) indicates 

the stage number. 

One vector of bits (Pj n-12+++sP5,0) can be asso- a h 
ciated to each node; each vector is calculated so b ¢§ 
that the following relation holds; 


0 K&S — ————————— Te 


n-1 
; I< YLQO_T Ny 
om ‘ ; f<\V\ 
In other words, (Pin-1?° -o9P3 o)is the representa 20 ew Elin SCL, 2 
tion of the number P in the binary number system. SesZ_ S&K BEEN KOK 
The interconnections between the nodes in _ the 3 PSSLSXO WPT 3 
: : , : i YX SC NX 
F-network are defined using the following rules. , ZX SS SA BRISA ? 
Definition 2.1. —— < 5 cA SS 
- _ See eet ees, 5 VZV SAKA 5 
e topology of an F network wi input an V, 
output devices can be obtained by connecting the 6 FO VIIW 6 
four outputs of a generic node P; (O<j<n)to the We aa 
nodes ES ie Q541> Ria. S41 shee the number of 7 Yo<Tf WAEN 7 


these nodes are expressed by the following strings 


of bits: Be oe 
e d 
P. =(P. hl, dati uidPin acta e. 
j+l ( jon-t?  *? 5g jtd? 559° 09,4570 °°’ 5,0 
=(P e9 . PRS SE ) 


= oe P P ; 
ae j n-l’ j,j+l’ 3,3’ j,j-1 3,0 Fig.2. The 8x8 F-network. 
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Once the right routing tag has been calculated, a 
request for the output D=(d_._,...,d,) generated 
by the input S(s -y770 0985) is routed according 
to the following recursive procedure: 


M = 5 
oO 
M = £(M t.),0 <j 4 
D=M 
n 


where M, (Og j< n) is the node at the stage j in- 
volved in the past between S and D. 

The most important feature of the F - network is 
that it gives 2" possible destinations, while 
4n = 22 gifferent routing tags are allowed. From 
the pathfinding procedure defined by (4), it can 
be established at once that the sequence of nodes 
M; composing a path is altered, even if only one 
digit composing the related routing tag is alte- 
red. Hence different paths correspond to different 
routing tags. 

Furthermore, since the number of the possible rou 
ting tags is larger than the number of the possi- 
ble destinations, several paths exist in a F-net- 
work, connecting an input to an output node. For 
example, if a message for the output 6 is genera- 
ted at the input node 4, in the network shown in 
Fig.2, it can be routed by using on eof the following 
tags: (0,1,0) (1,0,2), (0,2,2), (1,3,0). 

The following theorem provides us with two impor- 
tant results. 

It shows how a message can be routed through the 
network and how the redundant paths can be used 
to circumvent faulty nodes. Moreover, the algo — 
rithm presented needs a routing tag representa- 
tion using only ntl binary digits rather than 2n. 


Theorem 2.1. Let C = (C,_j,..., Co) = S@D, where 
the symbol @ indicates the bit-wise exclusive-or 
operation performed on the two vectors of bits S 
and D. Let r be a binary variable initially set to 
O and r(j) be the value of r after the completion 
of the j-th step of the routing algorithm descri- 
bed by the following recursive procedure: 


M. -£M,,(Cc,@ry), 1 FV (5) 
jtl jj 

or, alternatively 

M. =E(M,,2+(C, @rV)), ora (6) 
jtl j j 


with M =S and M =D. 
O n 


Proof 


In order to. prove the thesis of the theorem, it 
is sufficient to show that the j-th bit of M. 
is equal to d; and that the bits form O to j21 
are equal in M341 and Mj. In fact, once such a 
result is proven, the k-th bit of M, is equal to 
the k-th bit of M4 > that is dis hence M,=D. 

In general, when the routing algorithm reaches 


1 


the step j a certain number of routing steps ma- 

king use of the equation (6) have already been 

esecuted. 

The two cases of a even and an odd number of such 

steps are considered separately: 

a) even: r(J)=0 and Mm; j=Sj> since, from (6), an 
even number of complementations have been per- 
formed on these bits and their initial values 
where O and s-, respectively. From Definition 
2.1. we derive that the following relation 
holds: 


(j) 
m, .. .=m,, (x @Oc.)=s.@c. (7) 
jt+1jtl 45° J J j 
since C. was computed as s,(@d., the (7) beco 
fag J J J 
m, _..=8s.@s, @d.= d, (8) 
jt+hjtl ;? i Je wd 


b) odd: r(j)=1 and m::=s., since, from (6), an odd 
number of complementations have been performed 
on these bits, and their initial values where 0 
and Sj. respectively. Using the same arguments 
as in the case a, it is possible to show that 
the following relation is valid: 
ga je! a i 
In both cases the bits from the O-th to’ the 
(j-1)-th are copied from M: into Miay: 
At each step of the routing algorithm illustrated 
in Theorem 2.1., it is possible to calculate the 
next node of the path, using one out of two possi 
ble formulas. In other words, there are always at 
least two different paths starting from a node 
within the F-network and leading to the same de- 
stination node. Only when the message reaches a 
node in stage n-l, the alternative paths lead to 
the same destination node different merely from 
a formal point of view. In fact, while either e- 
quation (5) or equation (6) can be used, the re- 
sult is always D. This feature derives from the 
assumption that a different device is connected 
at each node in stage n. Theorem 2.1. assumes that 
the routing is performed on the basis of a binary 
variable r and a vector of bits C, by applying a 
sequence composed of a mix of two types of steps. 
By uSing all the possible patterns of steps, all 
the paths interconnecting the same input-output 


pair are generated.Given an input node, there are 
22 possible tags leading to the same output node; 
however, only 227! distinct paths exist, since 
the tags differing only in tp-] produce identical 
paths, Since 2” » the 
ach paths starting from an input node are e- 
qually distributed among all the possible desti- 
nations. Furthermore, the F-network is designed 
so that alternative paths exist at each stage. 
This feature allows an on-the-fly rerouting of a 
message, when some nodes in a network are faulty. 
In fact, if at step j of the routing algorithn, 
the next node selected by using equation (5) is 
faulty, the message can be routed to the node se 
lected by using equation (6), and vice versa. The 
on-the-fly re-routing can be usefully employed 

to enhance the network bandwidth, since the 
nodes previously acquired must not be released, 
and the re-routing is accomplished in a short ti 


possible destinations exist, 


me interval. 
3. Broadcasting 


Broadcasting capability is an important issue for 
a connecting network, since the algorithms execu 
ted on multiprocessor systems often require that 
the result of some computation should be sent to 
a pool of processor and/or memory banks. Broad- 
casting in the F-network is discussed in this se 
ction. 

In general, in an interconnection network, the 
path, for a multi-destination message is establi 
shed by duplicating, on different output links 

of some switch, either the arriving packet (pa- 
cket switching) or the request-to-connection (cir 
cult switching). The F-network also works in this 
way; furthermore, since more than two outputs 

per node are available, broadcasting in the net- 
work presented here has the same rerouting proper 
ties as the point-to-point transmission. 

The routing of a multi-destination message is per 
formed on the basis of the routing tag C, the bit 
r, defined in section 2, and an additional n bit 
broadcasting mask ad @ b,), used to iden- 
tify the switches which should duplicate the ar- 
riving message on their outputs. In fact, a node 
in stage j, finding b: = 1, duplicates the messa- 
ge, while, if bj = 0, the node behaves as in a 
point-to-point connection. The duplication of a 
message on two different output links of a node 
in stage j can be seen as the superimposing of 
two point-to-point connections, one with C 3=0 
and the other with C-=1. Hence, by applying for- 
mulas (5) and (6), either of the following nodes 
in the stage j+l can be selected for the copy of 
the message corresponding to C:=0 


GD _G) 


MTF, »0) (10) 


_ (j+1) —(j) 
fg r =r 
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while,for the copy of the message corresponding 
to C.=l, either of the following nodes can be u- 
sed: 


G4D_ @D 


M, 74,51) (11) 


M, .7£(45 3) de) 22 

Thus, in the F-network, two alternative paths 
exist for each copy of a message, which is dupli 
cated at a node to allow broadcasting transmission. 
Note that the two copies of a message may be rou 
ted independently, hence in general the value of 
r(jt+l) is not equal for both copies. 

It is easy to proof that if a message is duplica 
ted only once, in stage j, it will reach only 
two outputs, whose numbers differ in only the j-th 
bit. When many duplication of the same message 
occur, the final result obtained is equivalent to 
the superposition of the results of each single 
message duplication. Hence, the following theorem 
may be easily proven. 


Theorem 3.1 


broadcast to 
-+:dqQ) .Whose 
the 21 combi- 
and fixing a 


A source node S=(s,-],...5 %o) can 
the 21 destination nodes D =(d iy: 
numbers are obtained by taking all 
i f the bits d yates 
nations for e bits ki-4 diy 


value for the other n-l bits. The complete routing 


tag should be computed as follows: 


ee ) 
c =D@S 
B = (b i?" -sbo) 
where 
_ 1 J=Kysk)o-+esk. | 
j O otherwise 


Theorem 3.1 states that the cardinality of the 
set of the destination nodes should always be a 
power of two. In effect, it is possible to apply 
to the F-network a broadcasting scheme presented 
in (10] for IADM and ADM networks; in addition,it 
possible to eliminate the constraint, imposed 
the original presentation of this method, on 
the contiguity of the stages duplicating the mes- 
sage. 

The first step is to compute the difference bet- 
ween the number of destinations and the largest 
power of 2 less than that number. The binary 
presentation of such a difference can be embedded 
into C, by using the bits Cyeaprrees Ck? which 
are not used by the stages duplicating the messa- 


is 
in 


Le 


ge, as shown by the previous discussion. 
(C.AND.B) is a vector of bits, where the bits in 
position Ki _yoeeeoko express the current value of 


the count, while the others are meaningless and 


set to O. When a decrement of 2 is performed on 
the (C.AND.B) number, the bits of the result 
position kj_},-..,k, express the value of the in- 
put count decremented by 21, while the other bits 
are still meaningless. In order to obtain the cor 
rect output value of the whole vector C, the bits 
of the result (Xp -poee+sXQ)> obtained by decremen 
ting (C.AND.B), and the bits of C must be merged 
according to the following rule: 


in 


x ; j=k)»k geeey Kk, 


1 1-1 


(12) 
C. otherwise 


This decrement and merge operation is used by the 
routing algorithm performed by the control units 
of the nodes which should duplicate the arriving 
message. This algorithm is described by the proce 
dure shown in Fig. 3. 


if b,=0 

pee! 
then if (R.AND.B) and b,=0 Vjz>i 
eae aaa 


then do not duplicate the message and 
send the single copy as if C, + (i) 
substract 27 from (R.AND.B) 

set the counter to O if the result 
1s negative merge the result of the 
substraction and C 


else 


af count#0 


route the copy with. modified 
C as if Cj @ rfid 
c,@r i) =o; 

1 


Epen 
-slas if 


do not duplicate the message 
and send the single copy as 
if c. @r(i) =o; 


else 


Fig.3. Procedure executed by a control unit of a 
switch which should duplicate the arriving 
message. 


Note that the binary broadcast subtrees pruned are 
always those leading to the highest numbered out- 

puts. 

Finally, it should be noted that the F-network al- 
lows a message to be duplicated and both copies to 
the rerouted, at the same node, even when the al- 

gorithm described in Fig.3 is applied. In fact,the 
bits of C are never changed by the rerouting pro- 

cedure, hence the bits of C.AND.B entering a con- 

trol unit are always correct. 

In conclusion, the F-network is able to perform 
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the most sophisticated broadcasting techniques al 
lowed by other similar networks; in addition, it 
is able to combine such broadcasting properties 
with the dynamic rerouting capabilities discussed 
in section 2, 


4, Reliability modelling 


The primary goal of the F-network design was to 
obtain an interconnection network able to correc- 
tly handle the communications between its input 
and output devices, even if some nodes are faulty. 
The final result expected is the enhancement of 
the network reliability. The results presented in 
this section give an estimation of the reliabili- 
ty enhancement achieved. The analysis is based on 
the following assumptions; 

a) the faults occur, independently, only within 
the network nodes; 

the nodes in the stages O and n are considered 
fault-free; 

each kind of fault prevents the correct execu- 


b) 


c) 
tion of any node operation, hence a faulty no- 
de is totally unavailable; 

the whole system is considered faulty, when 
the number and the location of the faulty no- 
at 


d) 


des prevent the communications between 

least one input-output pair. 
Hypothesis a is obvious, since only network relia 
bility should be studied. Hypothesis b derives 
from the assumption that each input or output de- 
vice communicates with the network by only one 
port. In this case, the failure of the first 
switch connected to one of these ports prevents 
every communications with the corresponding device. 
Hence, the faults within the nodes in stages 0 
and n are not recoverable by using a suitable 
network topology, but their effect should be avoi 
ded only by the reliable implementation of such 
switches. 
Hypothesis c leads to a conservative analysis,sin 
ce, in general, a fault does not destroy all node 
functionalities. Finally, under hypothesis d, the 
occurrence of non-critical faults does not pre- 
vent any system operation. 


Theorem 4.1. 


The minimum number of faults leading to a system 
failure is 2. 

The proof derives directly from the routing algo- 
rithm, which allows two alternative paths at each 


stage. 
Theorem 4.2. 


The maximum number of faults possible without cau 
sing a system failure is * (log,N)-1) where N 


is the number of input and output devices. 


Proof 


From the proofs of Theorem 2.1., it follows that 


all the messages requiring the use of the node 


CO agp teat ats seer ee)? at the stage j, may be 


rtrouted only to the node (P P,P 


Sua or 
and viceversa. Hence, the N nodes whithin a stage 
can be divided into N/2 subets of 2 nodes, refer- 
red to as B subsets. Since the network does not 
fail until both nodes in any subset are faulty, 
the maximum number of faults in a non-faulty net 
work is > log, N. 


Theorem 4.1. and Theorem 4.2 provide the lower 
and upper exact bounds on the number of faults, 
which cause a system failure. However, it can ea- 
sily be realized that both the best and the worst 
cases occur only when some particular pattern of 
faults is found. Hence, the network reliability 
characterization provided by the results of — the 
previous theorems is too poor. Since the fault lo 
cation is random, the number of faults causing the 
system failure is a random variable. Thus, it is 
important to evaluate its mean value, k. The gene 
ral expression for k is the following one: 

L 


k= 2 


x, i P(i) , 


N 

L= ~((log, N)-1) (13) 
2 Z 

where P(i) is defined as follows: 

P(i) = Prf the i-th fault causes the system failurd 


This probability can also be expressed by the fol 
lowing formula: 


P(i)=Q(i-1) R(i) 


where: 


(14) 


Q(i-1)=Pr Sint faults do not cause the system fai 
lure} 


and 


R(i)= Pr fa fault causes the system failure | i-l 
faults have already occurred and the system 
is not faulty } 


In a NxN F-network, L Bsubsets exist, and the who 
le system is not faulty until both nodes in the 
same B subset are faulty. Hence, the fault pat- 
terns preserving the system functionalities are 
constituted by nodes belonging to different B sub- 
sets.The number of the groups of i-l different 


subsets is(.” ) 2 fault patterns correspond to 

each group, ecause it is possible to choose in- 

dependently for each subset the faulty node %since 
all the fault patterns are equally probable,Q(i-1) 
is given by the following expression 


_ ov piel PL 21 
SEES fei, bet 


(15) 


Given a non faulty F-network with i-1 faults, the 
i-th failure of a node causes a system failure 


paeayge Dy 
fe) 
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if and only if the new faulty node belongs to the 
same B subset as one of the previosly failed no- 
des. Since such nodes belong to i-1l different 
subsets, there are i-l nodes out of 2L-(i-1), the 
failure of which will cause a system failure. Hen 
ce, R(i) can be expressed as follows: 

i-1 


Ri) 
(O° Tia 


(16) 
Note that Q(1)=1 and R(1)=0, as required by Theo- 
rem 4.1., and Q(L+1)=0 and R(Lt+1)=1, as required 
by Theorem 4.2. By using the equations (13),(14), 
(15) and (16), it is possible to compute k. 

Fig.4 shows the value of k for different network 
sizes. It is worth noting that for Delta, IADM 
and ADM networks k is always 1. In fact, the Del- 
ta networks provide only one path between a in- 
put-output pair, hence the failure of a single no 
de will cause a system failure. 

The IADM and ADM networks in general provide mul- 
tiple 
put one; however, for some input-output pairs, 
there is only one path. Since each node within 
such networks is involved in at least one of the 
se unique paths, a single failure will cause a 
system failure. 

Let us compare the reliability of the F-network 
with that of a system of h parallel 2x2 delta net 
works like that shown in Fig.5 where a network is 
switched off as soon asone of its nodes fails. 


paths between an input device and an out-— 


For the sake of uniformity, it will be assumed 
that each delta network is implemented replacing 
each 2x2 crossbar switch with a fully connected 
bipartite graph with 4 nodes, each one constitu- 
ted by a switch like that shown in Fig.1, with 2 
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Fig.4, Average number of faults leading to the sy 
stem failure for different network sizes. 


‘input 


devices: : output 


: devices 


Fig.5. Redundant interconnection system composed 
by h parallel networks. 


inputs and 2 outputs. Furthermore, it will be as- 
sumed that the time to failure of each node is a 
random variable with negative exponential distri- 
bution and mean value At, Since each node of the 
F-network is twice as complex as that of a delta 
network, it is assumed that the failure rate for a 
switch of the F-network is 2A. 

At this point, it is possible to compute the mean 
time before the failure (MTBF) for the network pro 
posed here and the configuration shown in Fig.5. 
The MTBF of a F-network,MTBFp,can be computed as 
follows: 


ay +28 ( 
=Q, 
MTBF, SN 2, 


i-1 


= ariy") P(i) 
4=0 


For a single delta network, the failure is caused 
by a single node failure, because there is only one 
path between each input-output pair. Since the net 
work has identical nodes, the MTBF is given by the 
following formula: 


(17) 


MIBF 
4b 


It 1s worth noting that the life time of a delta 
network also has a negative exponential distribu- 


=(21) T=(AN Log. N) i (18) 


tion.Hence, it is easy to compute by using formu- 
las of classic reliability modelling the MTBF for 
a system with h identical parallel networks where 
each single network is switched off when one of 
its nodes fails.The final result is given by the 
following expression: 


MTBF 


h (19) 


-l 
A =(AN log.N) 


The plots of the MTBF for the F-network and for a 
system with h parallel delta networks are shown in 
Fig.6. 

It can be seen that the parallel delta networks a- 
chieve the same MTBF as the F only when the sizeof 
the network is small, for realistic values of h. 
However, interconnection networks are intended for 
very large multiprocessor systems, hence the range 
of interest is shown on the right side of Fig.6. 


After simple calculation, it can be seen that for 
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Fig.6. MIBF for the F and parallel delta networks. 


=32, more than 1000 delta networks are needed to 
achieve the same MTBF as a single F network; this 
number goes up to above 10©, when N=128 is consi- 
dered. 

In other words, the previous analysis shows that 
the MIBF attained by a single F network cannot be 
reached by a system where the redundancy is ob- 
tained by using several delta networks in parallel. 


5. Network modularity 


Previous work on the LSI and VLSI implementation 
of interconneting networks [12] = [14 | has shown 
that switches belonging to different stages must 
be integrated in a single circuit, in order’to ob 
tain the minimum chip count for implementing a gi 
ven network. Unfortunately, the use of basic blocks 
composed by switches of different stages imposes 
some constraints on the network topology. 

Hence it is not always possible to define such ba 
sic building blocks, for each interconnecting net 
work. The modularity issues of the F-networks are 
discussed in this section. The goal is to show how 
a F-network of a given size can be built, inter- 
connecting several smaller multistage subnetworks. 
of these subnetworks 
should be the limited number of interconnections. 
In fact, smaller the number of input and output si 
gnals, the smaller the number of pins required for 


The most important feature 


implementing each subnework in a single chip.Since 
the pin count rather than the area is the main li 
miting factor for the integration of large subnet 
works, the basic builing block with the minimum 
number of interconnections gives us the best VLSI 
implementation. 


. Definition 5.1 


A SUBF network with M=2- (my 2) inputs, referred to 
as SUBF (M), is a network obtained from an M in- 
put F-network, by using switches with 4 inputs 

and 2 outputs in the last stage. 

From Definition 5.1. if follows that a SUBF (M) 
has 2M output links. 

Each output node has 2 outlets, which will be di- 
stingnished by referring to them as the "dashed" 
link and the "solid" link, with an obvious referen 
ce to their representation in the figures of this 
paper. Hence, each SUBF (M) has M "solid" and M 
"dashed" outlets. 

In order to allow the connection of several SUBF(M), 
it is assumed that the implementation of the out-— 
put nodes of such networks is in accordance with 
the block diagram shown in Fig.7. The selection of 
the "dashed" or the "solid link" is performed by 
enabling the appropriate three-state buffer; in 
this way, several output links of a SUBF network 
can be tied to the same input link of a different 
SUBF network, without extra logic. 

A NxN F- network can be obtained by using (N/M) 
logyN SUBF(M) networks, arranged in logyN stages 
of N/M subnetworks. The first stage should perform 
all the routing functions of the first m stages of 
the F-network.From the routing algorithm presented 
in section 2, it can be deduced that a message en- 
tering the F-network from input P=(Pp-1°-++»Po) can 
reach either node in stage j, whose number is ex- 
pressed bY_(Pp-1+++>P4sXs+++sX) or by 
(Py-po +++ 2P52Xs+++s%) where a string of x stands 
for any binary string of the same length. 

Hence, in order to preserve such a behavior for 
O<j¢ n-l it is necessary for each SUBF(M) to 
group the input expressed by (Pya> +++ Pyay > Xoe+ 2%) 
and by (Py-p2 00 *2Pm_p 2X90 229%) 3 varying the string 
Pn-1>+++>Pm-] 411 the N/M pairs of groups are gene 
rated. Moreover, since a message occupying the no 
de (Py-po e+ + >Pm—124p—99 +++ 2do) at stage m-l, can 


reach either node (pp-]J>5 «vs Pp. d_» d 9 seg d Jor node 


"solid” 


link 


—— ee me eee 


*"dashed” link 


Fig.7. Output switch for a SUBF network. 


(Py-1> eeeP 2 In—y94n-29 s+» do) at stage m; it is ne 
cessary to use the two output links to reach all 
the possible nodes. In particular, the "dashed" 
link of the output (Pp-yorees Pm? Im—1°Im-29d9) is 
connected to the "solid" link of the output 
(Py-1>*++>Pm>dm—] >dm—-22+++2dQ). In this way, the 
choice of the value of d,_, is performed at the 
last stage of the SUB(M), while the adjustment of 
the most significant n-m bits in performed by cho 
osing either the "solid" or the "dashed" link. 
The second stage of SUBF networks is able to per- 
form the same operation as the first one; however, 
since it should operate on the second block of m 
bits, there is a m-unshuffle permutation, which 
causes an m-position right rotation of the bits 
expressing the node number. Each subnetwork gro- 
ups the inputs expressed by (dpe +2dosPn-p2 +++ 
P2m—1 2X» + ++sX)and by (dq yee.» Py—Ls +9 P2m—1 » Xs veeX) 
Since the most significant m bits cannot be chan- 
ged, the connection between "solid" and "dashed" 
links at the output of the second stage must not 
influence such bits. 

Repeating the same process shown above and taking 
account of the increasing number of most signifi 


_ cant bits which cannot be changed, all the log,N 
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stages can be laid out. A final m-unshuffle per- 
mutation is required to obtain the correct orde 
ring of the outputs. A 16x16 F-network built u- 
sing eight SUBF (4) is shown in Fig.8. 

Each SUBF(M) allowing the transmission of w paral 
lel signals per input, requires 3wM connections 
with the outside world. Whereas, each MxM subnet- 
work proposed in fiz} for a class of delta net- 
work requires 2wM connections. 

Let us consider the problem of implementing a fa- 
ult-tolerant network with N inputs and N outputs 
and B parallel signals per input. Two alternati- 
ves with similar costs are considered: the use of 
a NxN F-network, the use of a pair of NxN modular 
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Fig.8. 16x16 Enetwork built using 8 SUBF(4)networks. 


delta networks in parallel. 
In the first case, the number of modules to be u 
sed, bp, is given by: 


B B 
Wetec’ sa 2 (20) 
eee a f “PF 


while in the second case, the number of the sub- 
networks, is expressed by 

B 2Nlog2N B NlogoN 1 
A ow M ear _— es Gy 

a °2 oO b a 

The value of M is limited by the number of pins 
per package allowed, such a limitation does not 
depend on the type of the network. Hence, if the 
limited number of pins imposes a maximum of Zo 
connections per subnetwork, the following expres 
sions must hold: 


Zo 
M<— 22 
F° 3w (22) 
Z 
ioe (23) 
A~ 2w 


Assuming M =Z,/3w and M=Z0/2w the following rela- 
tion can be obtained: 


b/y aN /e,, = 0.75 + 0.4387 log M.. (24) 
From (24) it can be deduced after trivial computa 
tion, that for M?73.3., the number of chips requi 
red by a F network is smaller than that required 
by a duplicated delta network. Taking account of 
the relation Zo =3M,,W= 2M) w, it will be seen that 
b_ and b, decrease by increasing the values of 

M, and M_. Hence, the minimal chip count is achie 
ved by making M_(or My ) as large as possiblesin 
general, such a condition leads to M_»>3.3. Thus, 
the chip count for implementing a F-network is 
less than required for a duplicated delta net- 
work, although the former has better reliability 


and performances. 
6. Conclusions 


Let us now compare the results presented in the 
i 


sted at the beginning of this paper. The analy- 


previsions sections with the four criteria 


tical models presented in section 4 have shown 
that the mean lifetime of an F network is so high, 
that a similar result cannot be achieved by using 
several networks in parallel, where each network 
does not have intrinsic fault~tolerance properties. 
Hence F-network performs very well under crite- 
¥20n: 1s | 

The F.network achieves its high level of reliabi 
lity by introducing multiple redundant paths be- 
tween each input-output pair.The routing algorithm 
is only O(log N), since the F is one of the so- 
called "digit-controlled" networks [4], 
allows the control functions to be distributed a- 


hence it 
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mong several units; furthermore, the selection of 
alternative paths may be performed on-line, as re 
quired by criterion 2, so that simple recovery pro 
cedures are allowed and rerouting can be used to 
enhance the system performances. 

F is the only network presented in the literature 
which holds all the rerouting properties when bro 
adcast communications are considered, even if so- 
phisticated broadcasting techniques are used. 
Furthermore, since the F-network is a superset of 
the multistage cube network, it is also able to 
perform all the other switching functions of the 
most popular networks, as required by criterion 3. 
Finally, the number of active devices required by 
an F network is about equivalent to that required 
by redundant network composed of two delta net-— 
works in parallel. Moreover, if the chip count ra 
ther than the number of active devices in conside 
red as a cost function, the F network becomes che 
aper than two delta networks in parallel, although 
the former configuration has better reliability, 
performance and switching capabilities than the 
latter. 
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ABSTRACT 
A method for constructing a fault tolerant 
interconnection network is described. It uses 
error correcting codes to correct errors due _ to 
all single and many multiple failures of both 
switching elements and links, and requires O(Nw) 


encoders and decoders, where N is the network size 
and w is the size of the packet in bits, and less 
than w additional check bits. Also discussed is a 
method for isolating the failed component after 
one is detected. This result contrasts with 
previous results because it allows the network to 
continue operation while the fault is being 
located, rather than performing off-line testing. 


Introduction 
Much research has been done in the area of 
interconnection networks in multiprocessor 


systems, but not enough attention has been paid to 
making such networks fault tolerant. This paper 
addresses that issue and applies an old technique 
to the problem. 

This fault tolerant network design attempts 
to provide a cost effective method that allows the 
network to operate properly in the presence of 
some set of failures, including all single 
failures, while maintaining a _ straightforward 
routing algorithn. 

Previous approaches have attempted to solve 
the problem by providing alternate paths from each 
source to each destination, allowing a failed 
component to be bypassed. Falavarjani and Pradhan 
[FaPr81] and Adams and Siegel [AdSi82] propose 
using a standard network with an extra stage. 
With this extra stage, there is more than one path 
to each destination. Shen and Hayes [ShHa80] use 
a simple fault model, and explore the fault 
tolerant capabilities of conventional 
configurations. Yew  [YewP81] proposes’ using 
multiple layer networks, which are by default 
fault tolerant, since a failure in any particular 
layer could be avoided by deactivating that layer. 

These results have several limitations. 
First of all, many authors assume an unreasonably 
optimistic fault model, such as a_ switching 
element will only fail by becoming stuck at one of 
its valid states. Second, many fault models do 
not include failure of a data link, and all faults 
are assumed to be permanent and not’ transitory. 
Third is the extra complexity added to the routing 


algorithm. In order to avoid a faulty section of 
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the network, the location of the fault must be 
known by the sender prior to transmitting the 
packet. Finally, several passes’ through the 
network may be required to pass a permutation when 
a failure occurs. 

The approach here is to apply the’ techniques 
of error coding to correct errors caused by a 
single fault, and to analyze the failure data _ to 
locate the faulty component. The use of error 
correction codes in communication systems is not 
new [PeWe72]. Pradhan and Stiffler [PrSt80] 
discuss using error codes to achieve fault 
tolerance in many computer applications, such as 
ALU and memory design, but they do not mention 
interconnection networks specifically. This paper 
discusses using such codes to achieve a fault 
tolerant network design. 


ES NES | ONL ENRS 


The network to be considered here is the 
omega network of size N=2° for packets of w bits 
routed in parallel, although the technique is 
applicable to other networks as well. The packets 
can be buffered in a switching element to allow 
improved performance [CLYP81] [Chen82], and the 
network can be operated in one of two modes, SIMD 
or MIMD. 

The network described above can be viewed 
three dimensions. The x dimension is 
direction of packet flow. The y dimension is the 
different processors in the system. The 2z 
dimension is the parallel bits of the packets sent 


in 
the 


by the processors. One conventionally views a 
network in the x-y plane. 

A packet consists of many bits, but pin 
limitations in VLSI technology require that only s 
bits per package be allowed. Consequently, at 
least w/s packages are required to implement a 
single switching element. The fault tolerance 


scheme presented here will take advantage of the 
multiple package requirement. 

A virtual switching element is defined as a 
single column of physical switching elements and 
its associated control. This functions as a 
single w bit switching element. Similarly, a 
virtual link is the column of wires that implement 
aw bit link. Finally, a physical switching plane 
is defined as a single x-y plane of physical 
switching elements and their associated links, 
which with the control implement an s bit layer of 
the NxN network. 

The fault model is general and will encompass 
almost any single and some multiple failures of 
real components. It is assumed that any single 
interconnection link can fail, independently of 
any other link. An example of a link failure is 
one that is permanently stuck at logic level 0 or 
1, or some level outside the normal logic domain. 
It is also assumed that any single physical 


switching element package can fail independently 
of all of the others, and that such a failure 
causes some or all of the outputs of that 
switching element to be invalid. Link and 
switching element failures can be either permanent 
or transitory. 


Fault Tolerance by Coding Each Packet 


The principal of this design is to take 
advantage of the package redundancy caused by the 
large word size, and use an error code to correct 
errors caused by single failures. This requires 
great care, since faults in the network control 
could potentially route a packet to the wrong 
destination. 

As mentioned previously, several physical 
switching elements are required to implement one 
virtual switching element. As currently 
described, however, a single fault in the output 
of the control section could cause all of the 
physical switching elements in the virtual switch 
to route the outputs to the wrong port, which is 
undesirable. If the control section generated 
three control signals independently and the 
physical switching elements contained voting logic 
for these signals, then a single fault in the 
control section would not cause any of the 
switching elements to misroute the data. Hence 
the worst case failure would be aé_e switching 
element failure. If an error correcting code were 
used to correct the error caused by a single 
switching element failure, then the fault would 
not cause system failure. Furthermore, the same 
code can be used to correct errors from link 
failures. 

The previous discussion said little about the 
error correction code required, and did not 
address the problem of corruption of routing bits. 
When a switch failure occurs, then a contiguous 
run of s bits aligned on an s bit boundary could 
be in error. Any code used would have to be able 
to correct for such a failure. Furthermore, the 
destination tag portion of the packet must be 
corrected in the control section of every stage, 
so if an error destroys a bit in the tag, the 
packet will still be routed properly. 

Although codes to correct large bursts of 
errors exist [PeWe72], they are cumbersome to use, 
particularly in the coding and_ correcting 
operations. An alternative is to use a single 
error correction code on pieces of the data, and 
distribute the bits of the code words to different 
physical switching elements. Then a_ switching 
element package failure would only destroy single 
bits of several code words, all of which can be 
corrected. 

One could use a (27-1, 2—-m-~1) Hamming code 
[PeWe72] for such a purpose. The bits of data are 
divided into parcels of 2 -m-1 bits’ each. These 
are coded in parallel, resulting in k=w/ (2"=-m-1) 
groups of ne bit code words, each group 
containing 2-m-l1 data bits andwm parity bits. 
Each bit of a code word is sent to a different 
switching element. One possible assignment is to 
route bit i of group j to switching element (i+j- 
2)mod(2"-1)+1. The output of the switches at the 
final stage are wunscrambled, and the error 


correction procedure is applied. 

It was initially stated that the scheme was 
to allow all single faults in the network. 
Additionally, the errors from several common 
multiple failures can be corrected, so it is 
desirable to classify those multiple failures that 
do not disrupt correct network operation. The 
only multiple failure that would prevent 
successful network operation is one that destroys 
more than a single bit of a code word. A single 
physical switching plane can fail totally and the 
network will continue to operate. Thus if a 
failure occurs and the physical switching plane or 
some portion of it is implemented on a single 
card, then the failed section could be removed and 
replaced while the network continues to operate. 
Only multiple failures in the same virtual 
switching element or in different physical switch 
planes but common to at least one path would cause 
incorrect operation. Similarly, multiple link 
failures are allowed if they do not happen to 
destroy two bits of the same code word. 

The network can be made more resilient to 
error by applying the correction to the code words 
between every stage. Then the input to each stage 
is known to be correct, and the network can 
tolerate single faults in each virtual switching 
element. This change would require fogN times as 
many decoders as the original scheme. 

The main disadvantage of this technique is 
the large increase in the number of bits, which is 
m/(2"-m-1). The number of encoders and decoders 
required is Nw/(2™-m-1). The limitation on the 
maximum size code is the packet size (including 
routing bits) and the number of bits in a physical 
switching element. The total packet size must be 
at least s(2"~m-1) bits, or some of the bits in 
the switching elements will not be utilized. 

The most optimal code for arbitrary w can be 
selected by choosing the smallest value of m such 
that 2 -m-1 2 w/s. Then the number of parity bits 
required is ms, and the parity bit overhead is 
ms/w. Since m=og(w/s), the overhead is 
O(slog(w/s)/w), plus O(Nw) encoders and decoders. 
This is preferable to triple modular redundancy. 


Fault Location 


This section describes a way to analyze the 
failure data _ to locate the failures. The 
technique is simple and can be performed by an 
auxiliary processor while the network continues to 
operate normally. This contrasts with the 
techniques proposed by Feng and Wu [FeWu81], which 
require the network to cease normal operation 
while a series of tests are used to locate the 
failure. 

The error correction codes described in the 
previous section determine which bits have been 
corrupted, and thus isolate the fault to a 
particular physical switching plane. If a single 
physical switching element fails, then as many as 
s bits are incorrect, whereas a link failure will 
only destroy a single bit. 

The correction circuits for each data word 
cooperate together in locating a fault. If an 
error is corrected, then the source and 
destination tags are sent to a fault location 
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processor. The fault location processor 
accumulates failure information over several 
cycles of network operation. If all corrections 


are to the same single bit of the data words, then 
a link error is suspected. Otherwise, a switching 
element is suspect. Depending on the type of 
error, the fault location processor determines at 
which switching element or link the paths of the 
failed packets intersect. 

Determining the intersection of the paths can 
be accomplished by comparing the source and 
destination tags of the corrected packets. Lawrie 
showed [Lawr75] the path used by a packet is 
uniquely determined by its source and destination 
tags. A packet travels through switching elements 
in stage i 
pei as? 1 


e --5)D- D142 


i=l 
2<i<n-1 
= i=n. 

The “intersection of two paths can be 
determined by comparing the concatenated source 
and destination tags of the paths, and noting the 
common bits. The absence of a continuous run of 
identical bits (not including sh and D,) indicates 
the paths do not intersect at any switching 
element. The paths from 100 to 110 and 110 to 000 
do not intersect at any switching element. On the 
other hand, the paths from 001 to Oll and 111 to 
001 do intersect, because S&S D =3 is 10 in both 
pairs. Since it matches two bits from the left, 
the paths intersect at the second stage and 
switching element 10. The technique is similar 
for locating link failures, except a link requires 
n consecutive bits. 


In the presence of multiple faults, fault 
isolation can be difficult. If the network fails 
due to multiple faults then on-line fault 
isolation is impossible. If the multiple faults 


allow continued operation, then the data could be 
analyzed as before. This will result in a list of 
suspect locations, but they will not be completely 
accurate. If two packets go through two separate 
failed components in early stages of the network 
and happen to go through a common switching 
element, then that element could be flagged as 
faulty. For such cases off-line testing as 
described in [FeWu81] would be required. 


Conclusion 


This paper has presented a method for 
achieving a fault tolerant interconnection design 
using error correcting codes. The technique 
utilizes the multiple packages required to 
implement many parallel bits in the packet. The 
packets are encoded using several Hamming codes, 
and the bits of the code are distributed to 
different physical switching elements. At the 
destinations the packets are corrected for errors. 
The error caused by any single failure, and a 
large class of multiple failures, can be corrected 
by the codes. The network requires O(Nw) extra 
hardware in the form of packet encoders and 
decoders, where N is the network size and w is the 
packet size. Also, additional hardware is 
required in the form of extra bits to be 
transmitted, although that is dependent on the 
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particular Hamming code chosen. 
Also presented is a technique for comparing 
the source and destination tag bits of the packets 


requiring correction to isolate the fault in the 
specific switching element or link, while the 
network continues normal operation. This is 


different from previous fault location techniques, 
which require removing the network from the system 
for special testing. 

There are two problems with this technique. 
First of all, it presumed the existance of fault 
tolerant encoders and decoders for the error 
correction code. Secondly, the scheme requires 
many bits to be routed in parallel. For a bit 
serial method of transmission some other method 
for achieving fault tolerance be required. 
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Abstract ESL Incorporated is presently 
developing a high speed data flow computer desig- 
nated the Data Driven Signal Processor (DDSP). 
Intended primarily for signal processing applica- 
tions, DDSP is designed to be programmable and 
modular. The use of data flow architecture pro- 
vides a natural way of expressing parallelism in 
algorithms; DDSP maps this parallelism onto a mul- 
tiprocessing system that can be expanded without 
software modification. The maximum configuration 
of 32 processors occupies four chassis and has an 
execution rate of about 71 MFLOPS. Hardware and 
high order’ language designs were coordinated 
resulting in a compiler that generates extremely 
efficient code. 


1.0 INTRODUCTION 
The Data Driven Signal Processor (DDSP) is 
being developed by ESL Incorporated to meet 


requirements for a digital signal processor that 
is cost effective, programmable, modular, and eas- 
ily interfaced with other digital hardware. Data 
flow techniques are used because they provide an 
effective method for programming algorithms in a 
multiprocessor environment; data flow exposes the 
fine grain parallelism in algorithms without 
explicit software directives. This parallelism 
can then be spread out over several processors to 
increase overall performance. Data flow is also a 
natural way of expressing signal processing prob- 
lems, which engineers typically represent using 
data flow graphs. 

DDSP has been designed for ease of programming 
with a high order language capable of generating 
efficient machine code; it is modular, with a 
variety of possible configurations ranging from 
one to 32 processors; it is fast, with a full con- 
figuration operating at about 71 million floating 
point operations per second (MFLOPS); and it 
interfaces with a variety of devices allowing for 
concurrent data and I/O processing. 

Currently, the best way of getting low cost 
computing power is to use array processors such as 
the Floating Point Systems AP-120B. Over the past 
few years, array processors have permitted the 
solution to a wide class of vector oriented prob- 
lems, with a cost effectiveness unobtainable with 
conventional computers. However, this cost effec- 
tiveness can be lost when new software is 
required. Array processors are usually programmed 
uSing horizontal microcode that is difficult to 
write and maintain. The need for a high order 
language is born out 
productivity developed by R. 
Senne, and reprinted in (12). Bucy and Senne mea- 
sured the number of man-months to program a 
nonlinear filtering problem on several computers. 
Their results indicated that programming took 0.5 
Man-months using Fortran on the VAX-11/780, as 
compared to 6.0 man-months for microprogramming 
the Floating Point Systems AP-120B. This factor 
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of 12 is typical of the microprogramming experi- 
ence at ESL using other array processors. 
Although array processor hardware is inexpensive, 
the high software costs can greatly reduce their 
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effectiveness. Floating Point Systems’ has tried 
to alleviate high software costs by developing a 
version of Fortran called AP-Fortran. This 
approach is effective in improving programmer pro- 
ductivity but, for one benchmark (12) it generated 
code that was four times slower than a hand-coded 
version, The Data Driven Programming Language 
(DDPL) developed for DDSP, mitigates these prob- 
lems by providing a high order language that is 
capable of generating efficient machine code. 
This has been achieved because of the inherent 
expresSive power of data flow techniques, and 
because hardware and language designs were closely 
coordinated from the earliest conceptual stages. 
Data flow computers operate onan entirely 
different principle than conventional von Neumann 
computers: A von Neumann computer executes 
instructions one at a time (or sequentially), 
whereas a data flow computer executes nodes when 
the data for those nodes becomes available. Nodes 
are like the instructions used in von Neumann com- 
puters, except that they can perform more than one 
operation. These nodes are logically connected by 
arcs which are used as pathways between nodes. 
Arcs are used for sending tokens that contain the 


values used in computations. Arcs are 
one-directional pathways that have a source and a 
destination and are connected between the output 


port of one node to the input port of another. A 
node is activated when all required tokens have 
arrived at input ports and is executed when a pro- 
cessor is available. When a node executes, the 
input tokens to the node are consumed. A _ data 
flow program follows a single assignment _ rule 
which states that a token can only have one desti- 
nation. This rule permits the orderly allocation 
and deallocation of values as they are created by 
one node and subsequently consumed by another and, 
it exposes the parallelism in algorithms. 

The current work in data flow computers has 
concentrated on experimental designs that are 
applicable to the broad range of computational 
problems. Our approach has been to look at a spe- 
cific class of problems (i.e. digital signal proc- 
essing), and to use data flow techniques’ that 
result in a cost effective processor as compared 
to current methodology (i.e. using array process- 
ors). An overview of data flow concepts is beyond 
the scope of this paper; however, good introduc— 
tions are given by T. Agerwala and Arvind (7), and 
J. B. Dennis (8). Also of importance is a survey 
of data flow languages by W. B. Ackerman (9). Our 
design has been greatly influenced by the work of 
J. R. Gurd and I. Watson (1,2,3,4), and Arvind and 
K. Gostelow (5,6), especially in regards to the 
concept of dynamic tagged data flow. In DDSP a 
label field is appended to the data tokens in 
order to distinguish between different instances 
of the same token. A matching store resembling 
that proposed by Gurd and Watson is used to 
match-up pairs of tokens with the same label. The 
matching process is implemented using a hash algo- 
rithm similar to that described by T. Ida and E. 
Goto (11). The work by J. B. Dennis and K-S. Weng 
(10) on the use of data streams in data flow com- 
puters has also been influential to the DDSP 


design, because of the similarities between data 
streams and time series data used in digital sig- 
nal processing. 

The next section establishes some of the basic 
characteristics of DDSP including how it compares 
with other data flow processors and with array 
processors. Section 3.0 covers DDSP architecture 
with descriptions of matching store, the process- 
ing element and the interconnection network. An 
introduction to the Data Driven Programming Lan- 
guage (DDPL) is given in Section 4.0 with 
explanations on token labeling, the skew algorithm 
and a special data structure used for communi- 
cation. Finally, Section 5.0 gives the results of 
a discrete time simulation for two signal process- 
ing benchmarks, Section 6.0 presents an 
application of DDSP to sonar signal processing, 
and Section 7.0 reports on the current status of 
DDSP development. 


2.0 DDSP CHARACTERISTICS 


DDSP systems can be configured with one pro- 
cessor or expanded to a system having up to 32 
processors without software modification. DDSP 
can meet a wide range of performance requirements 
starting with a single processor operation at 2.22 
MFLOPS, and extending up to a 32 processor system 
operating at 71 MFLOPS. A single DDSP processor 
is packaged on two large printed circuit cards. 
Up to 8 processors can be packaged in a chassis 
along with a bus controller, diagnostic hardware, 
and one or more I/O controllers. The maximum sys- 
tem configuration of 32 processors is packaged in 
four chassis. As a performance benchmark, today's 
crop of array processors operate at 5 to 12 MFLOPS 
and can be matched by DDSP systems with 3 to 6 
processors. Large DDSP systems exceed the proc- 
essing capability of the CDC STAR-100 and CRAY-1, 
two of the world's fastest supercomputers. 

Several unique characteristics of DDSP set it 
apart from other data flow computers. These 
include: 


e a skew algorithm for routing data among pro- 
cessors 


e a special data structure called the data driv- 


en communication (DDC) structure used _ for 
transmitting data between procedures 

e generalized labels for multidimensional index- 
ing. 


The skew algorithm makes use of a_ token's 
label field to direct the token to a specific pro- 
cessor. Using this algorithm, it is possible to 
have a uniform distribution of processing for a 
wide class of array and signal processing 
problems. For many of these problems, the skew 
algorithm has the added benefit of localizing com- 
munication to nearby processors. 

In most data flow programming languages, pro- 
cedures are called with parameter lists similar to 
those used in conventional programming languages 
(13,14). For DDSP we have developed a more flexi- 
ble approach using a type of linked-list that we 
call a data driven communication (DDC) structure. 
A procedure is called simply by passing the proce- 
dure a pointer to the DDC structure. The 
structure not only contains data used in the com- 
putations, but also control information such as 
array dimensions and return pointers. 

Generalized labels are used in order to give 
DDSP a powerful method of indexing multidimen- 
sional data. All nodes are assumed to operate on 
arrays of data where dimensions are determined at 
the point where the corresponding tokens are gen- 
erated. 

DDSP is being designed as an alternative to 
array processors, by providing low cost computing 
power with much more system flexibility. The fol- 
lowing points demonstrate this flexibility: 


e Multi-tasking System. Any number of tasks can 
proceed in parallel. The hardware’ changes 
contexts (switches from one task to another), 
without any processor overhead. 


e Automatic data management. Data storage is 
allocated when a data value is calculated and 


is deallocated when the data value is used by 
subsequent computations. An associative memo- 
ry allows direct access into a data array even 
if some elements of the array have not been 
allocated. 

e Continuous data streams. This feature allows 
digital filtering operations to proceed with- 
out the usual overhead associated with index- 
ing across block boundaries. 

® Data dependent branching. Unlike array pro- 
cessors, DDSP handles data dependent branching 
with the ease of conventional computers. 

e Data feedback. Infinite impulse response 
(IIR) filters, adaptive filters, and phase 
lock loops can be implemented using normal 
programming techniques. The time required for 
feedback can be utilized by other’ tasks that 
are pending execution. By contrast, array 
processors have problems with feedback, often 
resulting in an underutilization of processing 
resources. 

e Macro compiler. Programmers can develop their 
own libraries of often used functions. In 
addition, system macros are available for 
standard functions such as I/O and control. 


3.0 DDSP ARCHITECTURE 


The need for a programmable, high speed signal 
processor was the primary reason in choosing a 
data flow architecture for DDSP. Although digital 
Signal processing involves a high proportion of 
vector operations, there are enough scalar, con- 
ditional and other "non-regular" operations’ to 
give data flow architecture a decided advantage 
over pipelined approaches used in many array pro- 
cessors. DDSP architecture is Similar to that 
developed by Gurd and Watson at the University of 
Manchester, whose primary motivation has’ been to 
build an experimental machine for data flow 
research. As aresult, instruction times’ for 
their machine have been kept relatively slow so 
the operation of the processor can be easily rede- 
fined. 

DDSP implements a dynamic tagged data flow 
model where tokens are tagged with a label field 
determined at run-time. A DDSP system consists of 
several processors that are closely coupled though 
an interconnection network. As shown in Figure l, 
a processor includes an input queue for temporar- 
ily Saving tokens, a matching store for 
associating pairs of tokens, and a processing ele- 
ment for performing high speed integer and 
floating point computations. The processing ele- 
ment receives a continuous stream of token pairs 
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DDSP processor block diagram 


for typical applications by designing matching 
' store to equal or exceed the speed of the process- 
ing element. 

There is an important trade-off in designing a 
multiprocessing system in regard to the’ speed of 
an individual processor vs. the number of parallel 
processors in a system. Some data flow research- 
ers (8) have indicated that thousands of 
relatively slow processors can be connected ina 
data flow system using serial 1/0. We feel, how- 
ever, that there is a point of diminishing return 
in regards to finding problems with enough paral- 
lelism to fully utilize such a large collection of 
processors. As a_ result we have designed DDSP 
with a relatively fast processor that can perform 
floating point operations in about 450 ns (2.22 
MFLOPS). Because of the high speed, it requires a 
relatively small number of processors to equal the 
speed of today's supercomputers, and for a given 
level of throughput, less parallelism in the prob- 
lem is required to keep all of DDSP's processors 
fully utilized. 

The interconnection network (like the rest of 
DDSP), has been strongly influenced by the nature 
of signal processing computations. Much of these 
computations are vector oriented, requiring highly 
localized communication. For the most part, com- 
munication is local to the originating processor 
or to its nearest neighbors. Although long dis- 
tance communication is required, it is usually for 
transmitting input values and summary results; 
these require an order of magnitude less communi- 
cation bandwidth than local communication. 

The architecture of the processor is shown in 
Figure 1. A data token enters the processor from 
the bus to the left of the processor and is placed 
in the input queue. The input queue provides load 
leveling within the processor and temporary stor- 
age when processing large volumes of data. The 
queue iS organized as a first-in-first-out buffer 
with input from the bus and output to matching 
store. 


3.1 Matching Store 


The matching store is a high speed associative 
memory used to match pairs of tokens having iden- 
tical keys. A key consists of an 11-bit node 
address identifying the node used for token proc- 
essing, and a 16-bit label field, providing the 
token with attributes such as index and iteration 
numbers. When a match is found, the pair of 
tokens and the key are sent to the processing ele- 
ment where the node definition is executed. If a 
match cannot be made then the unmatched token is 
stored in memory until a matching token’ comes 
along. Tokens used in unary operations don't 
require matching and are simply passed through 
matching store. 

Matching store is implemented by using a par- 
allel hash algorithm devised by T. Ida and E. Goto 
(11). The algorithm works on the same principle 
as hash techniques used by compiler designers. 
Ida and Goto's contribution has been the design of 
a high speed hardware algorithm that accesses hash 
tables in parallel and provides a means of delet- 
ing table entries when two tokens are matched. In 
our design, two parallel hash tables are imple- 
mented, each capable of holding 16K tokens. The 
hash algorithm is controlled using a state transi- 
tion sequencer implemented in firmware. A 
matching store operation starts by using a hash 
function to transform the 27-bit key into a 14-bit 
hash address. The contents of the two parallel 
hash tables are checked for a match, and if one 
exists, then the matching token is deleted from 
the hash table and the resulting token pair is 
sent to the processing element. If there isn't a 
match then an attempt is made to. store the token 
at the hash address; however, it is possible for 
both table entries to be filled resulting in table 
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overflow requiring special processing. In DDSP 
table overflow conditions are handled entirely 
within the parallel hash hardware. This contrasts 
with Gurd and Watson's approach (4), where an 
independent overflow unit is implemented using a 
relatively slow speed microprocessor. 

DDSP's matching store has a variable cycle 
time of between 110 and 150 ns; when hash tables 
are about half full, an average of 250 ns are 
required for a matching operation including the 
time for overflow processing. The approach in 
DDSP is to operate matching store with a faster 
cycle time than required by the processing element 
and to allow for a relatively high proportion of 
overflow processing. In this manner, we are able 
to implement matching store with two parallel hash 
tables as compared to the eight tables used by 
Gurd and Watson. 


3.2 Processing Element 


A token pair and the corresponding key are 
input to the processing element when it becomes 
available. The processing element includes a 
microprogram sequencer that controls two process- 
ing units: an arithmetic processor and a label 
processor. For the most part, these units operate 
independently although they can be tied together 
in order to share resources. The arithmetic pro- 
cessor includes an arithmetic logic unit (ALU) and 
a high-speed multiplier used for processing both 
floating point mantissas and integers, an 8-bit 
ALU for floating point exponent processing, and a 
memory unit to store constants and intermediate 
results. The label processor is used for creating 
new labels and performing various index 
operations. 

Integer operations are performed in one micro- 
cycle while floating point operations take 2 
cycles for multiplies and an average of 4.5 cycles 
for adds. Label processing is performed in paral- 
lel with these operations and can usually be com- 
pleted without adding to the overall processing 
time. Additional overhead is usually required for 
testing iteration numbers resulting in an average 
floating point speed to about 450 ns (2.22 
MFLOPS). This same type of overhead results in 
average integer times of about 250 ns (4 MOPS). 


3.3 Interconnection Network 


Data tokens coming out of the processing ele- 
ment's output queue are output to the intercon- 
nection network shown in Figure 2. The network is 
essentially a linear arrangement of processors 
with wrap-around from the last pair of processors 
to the first pair. This arrangement is augmented 
by a three level tree used for long distance com- 
munication, The processors are closely coupled 
with a minimum amount of network overhead required 
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DDSP interconnection network 


Figure 2. 


to pass tokens between processors. At the bottom 
level of the network are column buses (C-buses) 
used for local communication along the base of the 
tree. Tokens can be output onto one of the two 
C-buses on either side of the originating process- 
or depending on the token's destination. The 
network is organized like a packet switching net- 


work, where packets simply consist of a_ single 
token. Each token has its own network destination 


used in routing itself to any one of the process- 
ors in the system or to an I1/O port. For signal 
processing problems, the vast majority of the com- 
munication is with local processors; thus, column 
buses are used the majority of the time. To sup- 
port longer moves the basic linear arrangement of 
processors is augmented by a three level tree 
structure with communication between levels per- 
formed through bidirectional queues. 


4.0 DATA DRIVEN PROGRAMMING LANGUAGE (DDPL) 


One of the primary reasons 
Driven Signal Processor 
programming, aS compared to microprogrammed array 
processors such as the Floating Point Systems 
AP-120B. DDSP is programmed using the Data Driven 
Programming Language (DDPL), a high order language 
with syntax modeled after ADA and language con- 
structs designed for data flow computing. 

The programmer designs algorithms for DDSP 
using a simple conceptual model of parallel proc- 
essing, without regard to the actual number of 
processors in the system. The hardware configura- 
tion is specified as compile-time parameters, and 


for using the Data 
is the ease of 


when the configuration changes, the program is 
simply recompiled with new parameters. The pro- 
gram itself remains unaltered, allowing for’ the 


development of configuration independent software. 

A DDPL program has a block structure consist- 
ing of a program block containing one or more pro- 
cedure blocks; a procedure defines a logical group 


of actions that have a common purpose. Procedures 
communicate with one another by sending and 
receiving data driven communication (DDC) struc- 
tures containing both data and control 
information; these structures are in the form of 
linked lists where list members may consist of 


values or sub-structures. Values include data 
used in computations, as well as control informa- 
tion used to invoke procedures and route output 
data. A procedure contains node definitions which 
are the basic units of data flow computation; a 
node definition is similar to a task on conven- 
tional computers, because it can execute 
independently of the other software in the system. 
Node definitions contain all the executable code 
in a DDPL program including assignment, output, 
and nested if-then-else statements. In _ the Data 
Driven Programming Language, a node definition can 
have from one to four input ports, and may produce 
an unlimited number of output tokens. For nodes 
with three or four inputs, the compiler generates 
equivalent node definitions with the one and two 
inputs supported by hardware. 

As a data flow program executes, the same node 
definition may be activated thousands of times. 
In order to keep track of these various acti- 
vations, a generalized label is appended to the 
tokens; for a node to. be activated, all the input 
ports must have tokens with identical labels. The 
programmer defines the label field on a procedure 
by procedure basis by subdividing the label into 
label index fields with programmer defined mean- 
ings such as "Sample number", "filter number", or 
"user identification". When writing assignment 
statements the programmer continually makes refer- 
ences to these index fields in a manner analogous 
to array subscripting. 

A unique skew algorithm is used to map index 
numbers into specific processor destinations in a 
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manner that provides effective load balancing 
among processors. This skew algorithm allows the 
programmer to write software without knowledge of 
the specific DDSP. configuration. For example, an 
algorithm can be debugged on a DDSP system with 4 
processors, and subsequently used on a 32 process- 
or system without software modification, with the 
skew algorithm automatically spreading the compu- 
tations over the larger set of processors. 

A powerful feature of the Data Driven Program- 
ming Language is the ability to define macros and 
then to expand these macros at compile time using 
macro substitution and conditional compilation. 
The motivation for using macro substitution in 
DDPL is to give the programmer a great deal of 
flexibility and still allow the efficient gener- 
ation of machine code. The programmer’ can build 
macro libraries for often used operations, using 
the same syntax as DDPL programs. In addition, 
there are system macro libraries that include var- 
ious utility functions for DDC structure creation 
and manipulation, I/O device control, and various 
arithmetic functions. 


4.1 Destinations and Labels 


Tokens are primarily generated within node 
definitions when an output statement or destina- 
tion statement is executed. In addition, when the 
program starts execution, initial token values are 
generated based on information in the token decla- 
rations. No matter which method is used, a token 
must be given a destination and label. At the 
machine level, this requires the specification of 


a network destination, a local destination, and a 
label field. The network destination specifies 
the processor or I/O device where the token is 


being directed. The local destination indicates 
the destination node within that processor togeth- 
er with the specific port into that node. In the 
high order language, the programmer specifies a 
local destination simply by referencing the node 
input by its symbolic name. The label is speci- 
fied in a manner analogous to array subscripting, 
and the network destination is derived without 
programmer intervention by a combination of com- 
pile and run time operations on the label field 
using the skew algorithm. 

The manner in which labels are interpreted is 
based on a generalized label declaration provided 
at the start of each procedure. This declaration 
Specifies how the label field is subdivided into 
label index fields, provides a symbolic name for 
each index, and indicates the index range. In 
addition, the declaration specifies how the label 
indices are used to generate network destinations. 
Once a decision is made on label field use, a 
programmer's effort can be concentrated on the 
programming task itself without any further regard 
as to how tokens are routed among processors. 
Generalized labels are more flexible than the 
fixed format labels proposed by Gurd and Watson 
(13) because they can be formatted by the program- 
mer to meet specific application requirements. 

“The label declaration can be described using 
the following example declaration: 


label (TIME,COUNT,USER:5,8,3) using (1,-2,0); 


Here the label field which has a total of 16 bits 
is divided into three index fields with symbolic 
names TIME, COUNT and USER. The index fields are 
assigned a specific number of bits: the index 
TIME is assigned the high order 5 bits followed by 
COUNT with 8 bits and USER with 3 bits. The 
indices are considered unsigned magnitudes’ so 
indexing starts at zero and goes in the positive 
direction. Thus the 8-bit index COUNT has a range 
from 0 to 255 (i.e. 2**8-1). 


The DDPL programmer can think of the process- 


ors as being arranged linearly with the last pro- 
cessor connected back to the first processor. In 
this scheme, higher numbered processors are to the 


right and lower numbered processors are to the 
left. As a result, when a token generated in the 
last processor is routed one processor to the 


right, it in fact, ends up at the first processor. 
It is because of this wrap-around feature that the 
programmer does not have to know specifically how 
many processors are in the object DDSP system. 

A token is routed in the network based on the 
skew algorithm applied to the token label. Since 


the same algorithm is used in all processors, 
tokens with the same label will be routed to the 
same processor no matter where’ they were 


generated. In the above example, the using phrase 
specifies routing constants corresponding to each 
index. These constants are used by the skew algo- 
rithm in the following manner: Assuming that a 
node is executing in one of the processors, if the 
TIME index is incremented, then the resulting 
token is routed +1 (e.g. one processor to. the 
right). If COUNT is incremented the routing will 
be -2 (e.g. two processors to the left). If USER 
is changed, it will have no effect on routing 
because the corresponding routing constant is 
zero. As another example, if both TIME and COUNT 
are incremented, then the routing is the’ sum of 
the routings for the individual indices. 

To be more specific, the network destination 
(i.e processor number) for a new token is computed 
as a function of the token's label indices and the 
procedure's routing constants using the following 
skew algorithm: 


NDEST 


(R1*1I11 + R2*I2 + ... + Rn*In) mod NPROC 


where Rl, R2, ..., Rn are the routing constants, 
Il, 12, In are the label index values, and 
NPROC is the number of processors in the DDSP con- 


ee gy 


figuration. The resulting value, NDEST, is_ the 
logical processor number in the network. The com- 
piler is optimized to compute the function at 


compile time if at all possible. As an example of 
how this algorithm is used, suppose the label dec- 
laration is 


label (1,J3,K,L:2,2,2,10) using (1,-1,1,3) 
the 


and new label has the index values 
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Assume that this program is being compiled for a 
four processor DDSP system, then the current log- 
ical processor number is 1*0+(-1)*1+1*2+3*12 mod 4 
= 1. 
: When a node definition is activated, 
input label field associated with it referred to 
as the current label that contains the current 
indices; the processor where the node is activated 
1s referred to as the current rocessor. Within 
the confines of this activation, current index 
values are referred to by their symbolic names as 
specified in the label declaration. These values 
can be used in generating labels for output 
tokens, or used as operands in arithmetic and con- 
ditional operations. 

The programmer has several 
a destination. The most common 


it has an 


ways of specifying 
way is to use the 


symbolic name for a node input followed by an 
index list. The index list is an ordered list 


that corresponds one-to-one with the index identi- 
fiers in the label declaration. With reference to 
the above example, a destination might have the 
form: 
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XDATA (TIME + 3, COUNT - 1, USER) 


In this example, the TIME index is incremented by 
+3 and the COUNT by -1 relative to the current 


indices. XDATA is the symbolic name for the des- 
tination and, based on the routing constants 
above, the destination is 3%*1 + (-1)*(-2) = +5 


(e.g. five processors to the right). 
4.2 Data Driven Communication (DDC) Structures 


The primary means for communicating between 
procedures is by sending and receiving a special 
data structures called data driven communication 
(DDC) structures. DDC structures are used in DDPL 


rather than the parameters lists common to most 
high order languages) because of their 
versatility. For example: 


e DDC structures can be created without specific 
knowledge about the indexing scheme used by 
the receiving procedure. 

e The sending procedure doesn't have to 
complete set of data before 
a structure. 

e DDC structures can be created with parallel 
sub-structures. These sub-structures can be 
sent independently of each other, thus creat- 
ing parallel transmission paths between proce- 
dures. 

e DDC structures can be combined and separated 
simply by manipulating the pointers to these 
structures. 

e The actual movement of data values at the base 
of a structure, only occurs on the basis of 
data availability on the part of the sending 
procedure, and data demand on the part of the 
receiving procedure. This two-way data con- 
trol allows for an orderly flow of data 
between procedures. 

e DDC structures may include not only the data 
to be processed but control information such 


have a 
it starts to send 


as the array dimensions, the type of data in 
the structure (integer or floating point), or 
where the results should be sent. The struc- 


ture can also specify which procedure should 
be used in processing the data. 

e DDC structures can be used to time-share a 
procedure among. several calling routines. 
Independent DDC structures are used by each of 
the callers so that they can easily be identi- 
fied within the receiving procedure. 


The programmer creates and manipulates DDC 
structures by using a set of system macros. The 
macros generate node definitions that do the actu- 
al data manipulation. In Figure 3, an example is 
given on how DDC structures can be used for matrix 
multiplication. A matrix multiply procedure is 
sent a structure consisting of two sub-structures 
representing the matrices to be multiplied. The 
dimensions of the matrices are included as part of 
the sub-structures, and the columns are separated 
into their own sub-structures so that they can be 
transmitted in parallel. The matrix multiply pro- 
cedure receives the DDC structure, performs’ the 
computations, and creates another structure as the 
final result. 

The actual manner in which a DDC structure is 
transmitted is illustrated in Figure 4. The send- 
ing procedure outputs a call to the receiving pro- 
cedure in the form of a pointer indicating where 
the structure is being created. The call enters a 
first-in-first-out queue where it stays until the 
receiving procedure has an activation name avail- 
able. The number of activation names for a 
procedure is specified by the programmer and 
determines the number of calls that can execute 
simultaneously. One of the index fields defined 
for the procedure is used to hold the activation 
name. When an activation name is assigned to the 
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the receiving procedure makes a 
sender which includes a _ pointer 
the structure is to be used. The 
sends an acknowledge in the form 
of a pointer to the next level in the structure. 
This request/acknowledge process continues’ until 
the bottom levels of the structure are reached. 
At this point, the receiving procedure makes a 
request for data values from the sender, and the 
sender waits for data to become available. The 
data is subsequently sent to the receiving proce- 
dure and consumed as part of the computations. 


calling routine, 
request to the 
indicating where 
sender, in turn, 


4.3 Node Definitions 

Node definitions contain all the executable 
statements ina DDPL program. These nodes are 
Similar to tasks used by conventional operating 
systems: They represent a segment of code that 
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times parallel activation of 


label (TIME,COUNT,USER:5,8,3) using (1,-2,0); 
SYNC: token integer; 
local integer; 


node A, B is 
TEMP := A - B; 
if TEMP >= 0 then 
SYNC(0,0,0) := TEMP; 
else 
SYNC (TIME+1, COUNT, 5) 
end if; 
end node; 


:= COUNT+1; 


Example node defintion 


Figure 5. 


other soft- 
node has a 


can be executed independently of all 
ware in the system. In DDPL a 
relatively small number of input ports (from one 
to four inputs). Because of this, node defi- 
nitions tend to be short and thus”) result in DDPL 
procedures consisting of many independent nodes. 
Each of these node definitions represent an oppor- 
tunity for parallel activation, and DDSP achieves 
its high speed as a result of spreading these 
activations over several processing elements. 

One of the features of DDSP, is that there is 
no processor overhead associated with the transi- 
tion from one node activation to the next. Asa 
result, the number of node activations has little 
effect on the overall execution time. In fact, 
good DDSP programming practice requires that node 
execution time be kept to a minimum in order to 
allow as many independent activations as possible. 
To achieve this, nodes are kept short and program 
loops are implemented by multiple node 
activations. This involves the output of a con- 
trol token with an incremented index field and 
feeding it back to the node at the start of the 
loop. This process causes independent and some- 
the loop iterations. 
It also allows parallel execution of other nodes 
that are not part of the loop. 

Statements that can be used ina _ node defi- 
nition include output, assignment, destination and 


if-then-else statements. Output statements per- 
form basic integer, floating point and logical 


operations, and output tokens to specified desti- 
nations. Assignment statements perform the same 
types of computations, except the results are nor- 
mally used within the current node activation. 
Destination statements are used in conjunction 
with pointers in order to return results to desti- 
Nations determined at run-time. Availability of 
the if-then-else statement represent one of the 
key advantages of DDSP as compared to array pro- 


cessors. The fact that DDSP can perform data 
dependent branching means’ that different software 
can be used in processing the elements of an 


array, depending on the element values themselves. 


This can be achieved with no loss in processor 
efficiency. 
In Figure 5, an example node definition is 


presented, identifiers are declared at the program 
and procedure levels. The node executes when 
tokens with identical labels are available at 
input ports A and B; TEMP is computed and stored 
as a local variable and the if-then-else statement 
determines which of the two output statements to 
execute. The second output statement uses’ the 
current index value, COUNT, as an operand. 


5.0 DDSP SIMULATION RESULTS 
An important part of DDSP development has been 


the implementation of a discrete time simulator. 
The simulator models asynchronous operations 


Table I. DDSP Simulation Results for FIR Filter 


Effi- 
ciency 


(%) 


Execution 
Rate 


Number of 
Proces- 
sors 


3.80 MOPS 
7.58 
14.98 
29.02 
53.12 


MFLOPS 


Notes: 

(1) 32 tap finite impulse response (FIR) 
filter. 

(2) 64 samples processed in parallel. 

(3) MOPS: Million operations per second. 

(4) MFLOPS: Million floating point 
operations per second. 


involving interprocessor and matching 
store-to-processor communication down to the reg- 
ister level. Execution of node definitions are 
modeled simply by establishing the elapsed time 
between the start of execution and when tokens are 
output to the interconnection network. The simu- 
lator has been an indispensable aid in the design 
of DDSP by allowing the design team to perform 
trade-offs of performance vs. hardware complexity. 
It has also provided a means for evaluating DDSP 
performance for some typical signal processing 
applications. Results for two such applications, 
a finite impulse response (FIR) filter and a fast 
Fourier transform (FFT) algorithm are presented in 
this section. 

Table I indicates the execution rate and effi- 
ciency of a 32-tap FIR filter. The execution rate 
indicates the average number of arithmetic oper- 
ations being performed, excluding index and book- 
keeping operations; efficiency indicates the 
proportion of time that the processing element is 
performing useful work. The results indicate a 
Slight decrease in efficiency as the number of 
processors increase, primarily because the number 


of parallel operations is held at a _ constant 
level. 

Table II shows similar results for 256 and 
1024 point complex FFTs. In these experiments, 


the efficiency actually increased for larger pro- 


cessors configurations, mainly because the number 


of parallel operations was allowed to increase 
along with the number of processors. Results for 
the 1024 point complex FFT, indicates that a four 
processor DDSP system (about half a chassis) is 
comparable to a Floating Point Systems AP-120B. 


6.0 DDSP APPLICATIONS 


DDSP applications cover the 
sonar and image processing. DDSP implements basic 
functions such as digital filters, phase lock 
loops and FFTs. In addition, its programmability 
permits more specialized functions such as adap- 
tive filters, synchronous video integrators, and 
Signal search/recognition processors. 

DDSP can be used as an attached processor to a 
host computer or for dedicated applications ina 
totally self-contained configuration. In Figure 
6, one such application is shown for an adaptive 


fields of signal, 
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Table II. DDSP Simulation Results for 


Algorithm 


FFT 


Effi- 
ciency 


(3) 


Execu- 
tion 
Rate 

(MFLOPS ) 


Number 
of 
Proces-— 
sors 


Complex 
FFT 
Size 


1024 for 
AP-120B 


Notes: 

(1) Floating point arithmetic used. 

(2) Number of samples processed in parallel 
increase with the number of processors, 


array beam former (15) used in sonar signal proc- 
essing. This example shows the versatility of 
DDSP to handle a large number of time-shared func- 
tions in real-time. In Figure 6, data is 
collected from 16 independent transducers. The 
data streams are formed into a beam by adapting to 
a reference signal generated by the pilot. tone 
generator. The adaption is performed by a 
least-mean-squares (LMS) algorithm that compares 
the incoming signal streams with delayed versions 
of the desired signal. The bandwidth of the 
desired signal fixes the bandwidth of the main 
beam and the front end delay determines’ the beam 
look direction. Two sets of FIR filters are used 
in the processes: The first, implements the adap- 
tive LMS algorithm, resulting in continuously 
updated filter coefficients. The second, applies 
the updated coefficients to the original data to 
form the desired beam. The resulting data stream 
is transformed into the frequency domain by an FFT 


and then averaged to enhance its’ spectral compo- 
nents. Display formatting and control are also 
performed in DDSP, resulting in a _ completely 


self-contained system. 

The hardware for this application consists of 
a DDSP system with eight processors, packaged in a 
Single chassis. The processor utilization is 
about 50 percent, allowing for other time-shared 
applications. 


7.0 DDSP STATUS 


Our goal is to have a working DDSP prototype 
up and running in late 1983. The prototype will 
consist of four processors with complete software 
support. It will be interfaced to a VAX-11/780 
and a high speed digital recorder. The VAX will 
be used for compiling and loading software. Data 
input/output will be performed with either VAX or 
the high speed recorder. 

The hardware design consists 
designs. These include the 
essing element, bus controller, and 1/O 
controller. The first two designs are repeated 
for each processor in a DDSP system. The bus con- 
troller is repeated for each set of four 
processors. At this time, the functional design 
is complete for all but the I/O controller. The 
detailed circuit design is complete for the match- 


of four unique 
matching store, proc-— 
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ing store 
controller. 
The software support consists of four major 


and is nearing 


completion for the bus 


modules including the compiler, assembler, code 
compactor and diagnostic software. The detail 
software design is complete except for the diag- 


nostic software. Code and test is complete for 
the compiler. 


8.0 CONCLUSIONS 


DDSP’s primary attraction is that it solves a 
total system problem that involves the programming 
of multiple processors to achieve very high 
speeds, thereby providing a flexibility unobtaina- 
ble with array processors. It has several unique 
characteristics that set it apart trom other data 
flow computers including an efficient algorithm 
for routing data among processors, a special data 
structure used for transmitting data between pro- 
cedures, anda generalized labeling scheme for 
multidimensional indexing. Simulation results 
show that processor efficiency is extremely high, 
even for large DDSP systems with figures well 
above 90 percent in most cases. 
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SUMMARY OF A HYBRID DATA FLOW SYSTEM 


Gary N. Fostel 


Intermetrics Inc. 


Abstract 


Recent years have seen an explosive growth 
in research on high capacity systems 
incorporating large numbers of processors, 
producing solutions to varied aspects of the 
total problem. This paper explores the thesis 
that a selection from the available techniques 
together with a synergistic combination of 
device, architecture and programming technology 
can yield a very powerful, reliable and usable 
data flow system for a good price. 


Introduction 


Is there a pot of micro processors at the 
end of the rainbow? How big is the pot and what 
shape is it? ‘This paper summarizes one answer to 
‘the second; a more detailed discussion can be 
found in [10]. The main ideas behind this 
development are: i 


o The programming methodology must be 
substantially improved; Data Flow languages 
provide a point of departure. 


o The system’s architecture must be "two 
dimensional" to be consistent with current 
mass. production technology (chips and 
boards) . 


oO Hardware must be matched to the software for 
- efficient execution and human comprehension. 


Multi-level Programming 


One methodology currently under development 
by the Navy for signal processing algorithms 
(EMSP [7]) embraces three such levels: low level 
coding for computational fragments, a data flow 
(DF) language for organizing those fragments, and 
a HOL for control programs -which monitor and 
control the DF graph (DFG). This split allows 
natural expression of the three different, but 
related facets of the problem. 


While multi-level expression might seem to 
invite incoherent and wumnanalyzable systems, 
program verification results suggests just the 
opposite. Gutag argues that a single formalism 
will not adequately span the verification 
requirements of any real world problem [9] and 
presents a three layer methodology of local, 
organizational, and system specifications. The 
is a compelling match between the pragmatics of 
programming in EMSP and the abstract issues of 
certification. 
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greater 


The Web programing methodology similarly 
embraces three levels: machine code, DFG’s, and 
ACTORS. The nature of the machine e depends 
on the processing elements employed and might be 
systolic machine descriptions or hand -tuned ASM 
code for side-effect free primitve computations. 
DF languages are compatible with notations used 
by engineers and are nearly ideal in that context 
for organizing the primitives defined by the 
machine code. A variety of data flow languages 
exist and are well described elsewhere [1]. ACTOR 
systems have object oriented, message passing 
semantics: a generalization of the DF model with 
expressive power owing to greater 
flexibility. The combination of expressive power 
and data packages makes such a top level control 


a ideal extension of DFG’s. 


OPERATING SYSTEM 


6882322-5 


Figure l. 


Figure 1 shows an application with intensive 
input reduction (Gl) and output conditioning — 
(G2). A probe is used for a monitor operation 
(P). Persistent data is maintained by ACTORs (Al 
and DB) and analyzed out of band by ACTOR A2. A2 
detects a need to create two instances of G2 and 
the output of Gl is replicated and fed (as input) 
to the G2’s. The OS provides device interfaces 


Dl, D2, and D3, as well as graph loading, 
execution, replication of Gl output, and linkage 
with G2. The ACTORs serve as "intelligent glue" 


to hold DFG’s together as a larger system. The 
computation in Figure 1 might be a military 
system which detects a threat, launches two 
missiles, and then guides them to their targets. 


Web Architechture 


VLSI chips will provide cheap performance 
only if they are part of a larger system which is 
itself easy to build. For example, many to many 
connection networks [2] solve some problems, but 
may be costly tO manufacture. Dennis [1] has 
noted that the "complexity, as measured by total 
wire length, grows as O(n**2)" for physical 
layouts of components in many to many networks. 
Indeed, the connection topology is the heart of 
the Web design; the name "“web" reflects the 
Similarity of spider webs and sketches of the 
inter-connection topology. There are three 
specialized processors (chips) to match the three 
levels of the methodology. 


P -— Processor for low level computational 
fragments. (P-ipeline) 

M — Management of the data flow graphs 
and interface to the ACTORs. (M-emory) 

G - ACTOR control systems and Web com 
munications and OS. (G-ateway) 


There are two interconnect patterns in the 
Web, one between P’s and M’s and another between 
the M’s and G’s. The M’s thus serve as_ the 
interface between the processor and control much 
as a shared memory in a more conventional model. 
While the M will indeed contain large areas of 
on-chip memory, significant improvement over a 
passive device is possible with addition of on- 
chip intelligence and application tuning. 


The M-P grid is principally devoted to the 
data paths needed to support high bandwidth DFG 
operation. The application DFG’s will be mapped 
onto the network illustrated in Fig 2. The most 
Critical properties of this model are homogeneity 
to simplify mapping, high connectivity to resist 
‘Single failures and allow high bandwidth and lack 
of any connections which cross to. simplify 
fabrication. The lack of universal 
interconnectivity implies trade-off in the 
quality of the mapping of the DFG onto the M-P 
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The MG _ grid is responsible for 
communications among ACTORs and with the external 
environment. The requirements here are distinct 
from those of the M-P grid: delay is more 
critical, volume of data is lower. Larger grid 
distances are likely in order to maintain global 
control of the Web. The low level view of the 
MG grid is a number of local networks with a 
small number of M’s, bounded by G’s which act as 
gateways to other such local networks. One such 
might be "G-M-M-M-M-G": the M’s are said to be an 
M-string. The high level topology of these 
strings is illustrated in Figure 3, with the 
intersection points representing G’s and the 
connecting lines the M strings. The result is a 
tree network with additional circular links added 
to reinforce the fault tolerance and delay 
characteristics of such trees, yet containing no 
crossed lines. These networks are discussed in 


[5]. 
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Figure 3. 


Implementation Problems 


First, the the DFG to M—P mapping must be 
optimized; further the control system must be 
able to keep up with the processing. system in 
the M-G grid will be hard pressed to control the 
application. Second, the M, P andG chips must 
be designed and built; the goals are aggressive: 
million bit chip technology and miniaturized, HOL 
architecture with store and forward network 
interfaces in each’ chip. An optimization 
strategy is proposed (Figure 4) which integrates 
pre-compile, compile, load and run-time 
techniques. 
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Load HeuristicS  q——_————» 0S_ Measurement 
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Figure 4. 


Graph theoretic techniques can identify 
clumping properties of DFG’s to reduce the volume 
of scheduling decisions made at run-time. Most 
appropriate to the model in Figure 4 is work by 
Stryker [81 on deriving global properties of 
DFG*s from local properties of the nodes in the 
DFG, especially consumer-producer relationships. 
Informal analysis can produce additional results 
which is passed to the translator and run-time 
system as semi-formal advice, in the manner of 
pragmas in Ada. 


Applications are loaded via mapping tricks, 
e.g. identity node creation, spliting, clumping 
and replication to heuristically squeeze the 
tasks into the "best" spot. If the "best" is not 
good enough, or if a chip fails, overloads result 
which must be gradually spread over nearby 
processors. A simple rule to achieve this is: 
Processors constantly grab as much work as they 
can find, with whatever reserve capacity they 
have to look for it. This is similar to 
"diffusion scheduling" used by Ward and Halstead 
[4] for control of the Munet. 


The rectangular grid of the Web presents a 
problems. There must be four independent 
connections on each M and P chip. Each of these 
connections must be high bandwidth, leading to a 
P with 128 pins! The M is worse. While not 
unheard of, such pin counts are expensive in 
silicon area devoted to pin out loads. In any 
event, a dense rectangular packing of the M-P 
grid allows no space for the G chips. A very 
nice solution was proposed by Miller [6], using a 
hexagonal grid, which leads to the layout shown 
in Figure 5. All interconnections for the M-P 
grid can be achieved in a single layer, with 
extremely short connections. The longer radial 
and circular connection of the M-G grid require 
‘only one more connection layer. — 
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FUNCTION SHARING IN A STATIC DATA FLOW MACHINE(?) 


Kenneth W. Todd 
Laboratory for Computer Science 
Massachusetts Institute of Technology 

. Cambridge, MA 02139 


Abstract — Sharing a single copy of the body of a function 
among its invocation points in a program has been an important 
means of keeping down the size of large programs and thus 
enabling them to run on conventional computers. To do the same 
for programs run on a static data flow machine is also desirable 
but not easy because of the nature of the machine. This paper 
presents a scheme at the machine level of a static data flow 
machine for sharing a function among its activation sites which 
can be further modified to accommodate a variety of constraints. 
For programs using this scheme, space consumption is reduced 
but at the cost of an increased execution time. 


The Static Data Flow Machine 


— With the advance in VLSI technology, it has become easier 
to make smaller and cheaper custom computer components. That, 
coupled with the current research on distributed systems and the 
desire to exploit parallelism in programs, has made data flow 
based computation attractive. 


Unlike traditional computers based. on von Neumann 
architecture, a data flow machine has no “program counter”. 
Instead, an instruction executes when the values for all its 
operands have arrived. After execution, its result is sent to other 
instructions, possibly making some of these ready. Hence, 
instruction execution sequencing is based .on the data 
dependencies among them and not on their location in the 
program memory. A high degree of parallelism is also obtainable 
since at any instant more than one instruction can be ready. 


Data flow computers can be classified as either static or 
dynamic. Both classes base their execution on data flow 
principles, but the specifics vary. This paper is concerned with 
the static machine, which for the purposes of this paper differs 
from the dynamic machine in three ways. First, the static machine 
lacks a runtime loader — a program is loaded in its completed 
form. Second, an instruction can have at most one activation ‘at 
‘any instant. Third, instructions and their operand values are 
stored together, making them not “pure”. A more in-depth 
discussion on a static data flow machine can be found in [5]. For 
details on a dynamic data flow machine see [1] and [2]. 


Figure I shows the configuration of the static data flow 
machine currently, under development ‘by the Computation 
Structures Group of the Laboratory for Computer Science at MIT. 
It is constructed from two types of components: processing 
elements (PEs), which hold both program and data, and 2X 2 
routers, which allow the PEs to communicate with each other. At 


2X2 2X2 


2X2 


results and signals 


Figure 1. A Static Data-Flow Machine. 


(a) This research was supported by the National Science 
Foundation under grant no. MCS-7915255 and the Department of 
Energy under contract no. DE-ACO2-79ER10473. 
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present this prototype consists of four PEs and a network of four 
routers. By the adding more components the potential amount of 
parallelism obtainable can be arbitrarily increased. 


Instruction Cells 


The basic unit of execution in the static data flow machine 
is an instruction cell. A graphical representation of a simplified 
instruction cell is shown in Figure 2 as a box with several fields. 
The top field contains the opcode of the cell. Directly below it 
are the (initial) signals-needed value field and the signals-reset 
value field, whose functions are deferred until later. At the 
bottom of the cell are fields that hold the operand values. From 
the right of the cell extend result arcs and signal arcs. A result 
arc, represented by a solid line, is used by a cell to send copies 
of the result to operand fields of other cells. Signal arcs, 
represented by dashed lines, are used by cells to simply signal 
each other. This -example shows a cell that computes 
“B:=A+2”. The actual cell used by the prototype is more 
complex and hence more powerful than the one presented here 
and is described in detail in [8]. 


A cell cannot: execute (or. fire) until it is ready. For this 
simplified version of a cell two conditions must be met: (1) the 
value of each operand must be present, either as a constant or a 
value received via a result arc from another cell; and (2) the 
signals-needed value must be zero. When a cell meets both 
conditions, its number (or address) is placed on a queue of ready 


cells maintained by its PE. Eventually the cell is fired, consuming 


the values of all non-constant operands in the process, sending 
copies of the result to operands of other cells as indicated by the 
results arcs, signaling other cells as indicated by the signal arcs, 
and overwriting the signals-needed field with the the 
signals-reset value. 


It is often necessary to prohibit one cell from firing until 


after another cell has fired. For example, if cell X sends its result 


to an operand of cell Y, X should not fire again until Y has fired 
and is thus ready for another value from X. To insure this, the 
signals-reset value of X is set to 1 and a signal arc is established 
from Y back to X. When X fires, its signals-needed value is set to 
‘1, and it cannot fire again until this value returns to zero. With 
each signal reception, its signals-needed value is decremented. 
When Y fires it signals xX via the signal arc, causing the 
signals-needed value of X to turn zero, meaning that X can then 
safely refire. This example is so common that in order to keep 
the graphs readable, a signal arc that is associated with a result 
arc will be abbreviated by omitting the signal arc and replacing 
the arrow head of the result arc with a solid one. 


As an example,.Figure 3 shows three snapshots of the 
instruction cell program graph calculating “C:=2+%#A+B”. 
Snapshot #1 shows the state of the graph when values for A and 
B have arrived. Snapshot #2 shows the graph after the IMut cell 
has fired. Snapshot #3 shows the graph state at the end of the 
computation when the 1aDD cell has fired. 


(initial) opcode signals-reset 
signals-needed value 
value 
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operand values 
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acknowledge arcs 


Figure 2. An Instruction Cell 


Figure 3. Three Shapshots of an Instruction Cell Program Graph 


Functions . 


| The high level source language from which instruction cells 
are generated for the prototype is VAL [3]. VAL is a functional, 
side effect free language designed primarily for numerical 
computations. Because instruction cells are functional in nature 
and since side effects would place restrictions on the sequence of 
instruction execution, VAL is ideal as a data flow source language. 


VAL, like most high level languages, provides for function 
definitions. A survey of the methods used in both dynamic data 
flow and conventional machines to implement functions reveals 
that none. can be successfully applied to the static data flow 
machine. The code of a function in conventional machines was 
Originally impure and worked as long as the function had no more 
than one activation at a time. When recursion was added, the 
code needed to be pure and this was accomplished by moving the 
data and return address to a frame on a runtime stack. As for 
dynamic data flow machines, some link and load a fresh copy of 
the function body at the time of the call while others use colored 
tokens that permit multiple activations of cells. 


A straightforward procedure for implementing function calls 
in the static data flow machine is to insert the body of the 
function at each point of invocation, the same process that a 
time-optimizing compiler for any language would do with a small 
side effect free function. However, as the number of invocation 
points increases and as the size of the function grows, such a 
process can result in a rather large instruction cell program 
generated from a comparatively small VAL program. Since a single 
cell consumes more memory than a corresponding instruction in a 
conventional computer (32 bytes in the prototype), it would not 
take long’ for this process to fill the memory of .a. data flow 
computer. In cases like this, a scheme for implementing the 
sharing of one copy of a function body among many or all of its 
invocation points can be advantageous since it would significantly 
reduce the number of cells generated. 


To accomplish this sharing, four problems must be dealt 
with. First, arbitration must be performed among different 
invocation points simultaneously calling the function. Second, the 
data associated with each invocation must be kept separate. 
Third, a way must be established to determine to which invocation 
point to send the results. Fourth, deadlock must be avoided. 


Figures 4 and 5 show how this sharing process can be 
implemented for a function that takes N arguments and returns M 
values. Figure 4 shows what is needed from the vantage point of 
a caller while Figure 5 shows what must be done on the part of 
the function. Starting from an invocation point in Figure 4, as 
each argument value of the function becomes available, cell 
number X is signaled and the value is stored in an ID (identity) cell 
where it waits for a signal to proceed into the function body. 
When all argument values are ‘ready, the signalseneeded value of 
X is zero and it fires. This implies that the function call is strict in 
that an activation does not start execution of the body until all 
argument values are ready. To allow otherwise might result in 
mixing data of different activations. 
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Figure 4. Function Sharing from a Caller’s Vantage Point 
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Figure 5. Function Sharing from the Function’s Vantage Point 


When X does fire it sends the value “Y” to the SER 
(serializer or non-determinate merge) cell in Figure 5. The SER 
cell is a special case for the rules that determine when a cell is 
ready to fire because it needs only one operand, and the result it 
produces is the value of that operand. In case both operands are 


present, the SER arbitrarily chooses which one to use this firing 


and selects the other one next time. Using a binary tree of SER 
functions, arbitration between any number of simultaneous 
function calls can be achieved. 


| Eventually, the value “Y” reaches the root of the SER tree 
and is sent to the SFAN (signal fan) cell. This cell has the effect of 
creating a temporary. signal arc between itself and the cell 
number specified by its lone operand, as indicated by the 
fan-shaped object in the figure. Thus, when this cell fires, a 
signal is sent to cell number Y. 


Referring again to Figure 4, once Y has received this signal 
it fires and signals the 1D cells holding the argument values. These 
cells in turn fire, sending -their values into the function body. 
When the cells that receive the arguments have fired, they send 
signals back to the SFAN cell of Figure 5 as shown. This prevents 
a different invocation point from sending its argument values to 
the function body before it is safe to do so. 


While the function is executing, it needs to remember where 
to send the results. Also, other invocations should be allowed to 
proceed with their function call when the body is ready. For this 
the first-in-first-out (FIFO) buffering of Figure 5 is introduced. 
This FIFO can be a chain of 1D cells or some other construct that 
behaves like a queue. Because the function body also exhibits 


FIFO-like behavior, if invocation / starts before invocation / then 
invocation / will terminate before invocation j. This will correctly 
match each set of function results with the corresponding value 
through the FIFO buffer. 


After the results have been produced they must be sent to 
the caller. To accomplish this, the FAN! (fan to operand 1) cell is 
used, a cell much like the SFAN except that the arc it creates is a 
result arc from itself to the first operand of the cell specified by 
the FANI's first operand, and the value sent over this arc is the 
FANI’s second operand. For each result there is a FANI cell, and 
when the return value has been produced, it is sent to the second 
Operand of its FAN! cell as shown in Figure 5. There is also an ID 
cell ‘at the activation point that is used as a receiver for this value 
and an IADD cell at the FIFO’s end that is used to calculate the 
number of that 'D cell. Thus, for the i! result, "i is added to the 
“y” that eventually exits the FIFO and this sum is sent to the i! 
FANI cell along with the i! return value. When the FANI cell fires 
it sends this return value to cell Y +. 


One drawback to this scheme is that a single invocation 
point is prohibited from having concurrent activations. Suppose 
for example that invocations A and 8 share the same function with 
the results of A being fed to 8, possibly indirectly. If A 
generates results faster than B can consume them, this stream of 
values will eventually extend back to the function body itself and 
thus prevent the completion of any more calls. If B-then attempts 
a call, it too will not be able to complete. This would result in 
deadlock since B is waiting for A to complete its current calls 
while A is waiting for B to use the values if has sent. By 
including the signal arcs in Figure 4 from cell numbers Y + 1 
through Y + M to X, a second activation of A will not start until 
the first one has completed and thus cleared the result receiver 
cells. This will prevent the backlog of values from stopping the 
output of the function body. 


Pratical Considerations 


Unfortunately, this last restriction degrades the efficiency 
of this function sharing scheme when used in a pipeline. If it can 
be determined at translation time that an invocation using a 
shared body is independent of all other invocations of that body, 
then this restriction can be lifted for that invocation: 


If the function takes only one argument, a slight 
optimization can be performed by combining Y with the 1D cell that 
holds the lone argument value. An increase in the maximum rate 
at which the function body could handle activations would result. 


If many simultaneous calls to the function body are 
expected, then it will be desirable to make the body a maximal 
pipeline to achieve a high throughput rate. For sirnple 
expressions and conditionals, [4] describes an algorithm that 
achieves this by using buffering 1D cells. For complex VAL 
constructs such as loops, a description of more complicated 
techniques required can be found in [7]. 


There maybe a limit to the signals-needed value; in the 
prototype it is 15. Because of cell number X, the sum of the 
number of arguments and return values is limited to 14, which is 
not too confining. The number of result and signal arcs might. also 
be limited; in the prototype it is 6. This limits the number of 
arguments to 6 because of cell number Y — again, not a serious 
drawback. Functions that exceed either of these two limits should 
reduce its number of arguments (results) by combining them into 
a single record and passing (returning) the record pointer. 


If addresses of cells are used instead of cell numbers, then 
cell number Y +7 would be found by adding / times the size of a 
cell to the address of cell number Y instead of simply i, assuming 
that all cells are the same size. If not, then it may not be 
possible to calculate the return cell numbers at run time. In this 
case the .original scheme can be modified by augmenting the 
function to take M additional arguments. The values of these 
arguments would be the addresses of cells Y + 1 though Y + M 
which are compile time computable constants and they would be 
sent via FIFO buffering to their respective FAN! cells. 
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An example 


A test of this sharing scheme has been performed on the 
following example: 


tar{x) = sir{x) | cos(x) = cos(x - 7/2) [| cos(x) 


Since the cos function is invoked from two different points, it is a 
candidate for function sharing. 


For the prototype, two translations — shared and unshared 
— were derived for the fan function. A summary of these 
translations and the results of computing fan(1.0) are given below: 


Non-Sharing Sharing Sharing 

Translation Translation Improvement 
Cells Generated 115 cells 75 cells 35% 
Cells Executed 91 cells 114 cells -25% 
Passes Performed 33 passes 45 passes -36% 


(A pass is the simultaneous execution of every cells that is 
currently ready: It would be the order of the execution time of a 
program if each cell fired as soon as it was ready.) As the table 
shows, there is a time-space tradeoff involved. The size of the 
program is significantly reduced when sharing is used but at the 
cost of an increase in both the number of cells executed and 
passes performed. It should be noted that a large part of this 
increase is because the two cos calls are simultaneous. 


Conclusion 


Function sharing is not for every application. It can be a 
bottleneck in pipelines and in general increases the execution time 
of a program. However, it can significantly reduce the size of a 
program. In some cases the size reduction obtained would allow a 
program to be translated for and run on a data flow machine that 
would otherwise be too large. 
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ABSTRACT 


FP, John Backus!’ applicative language naturally 
expresses concurency in programs. SERFRE evaluates 
FP programs by exploiting their built-in concurency 
as much as possible. In the single user version it 
has one I/O-processor (that either updates memories 
of the C-processors when a data or a program defi- 
nition is given, or initiates a program evaluation 
and returns the result to the user), and many 
C-processors (that evaluate the programs). They 
are organized in modules, a module being a small 
number of C-processors and a strictly non-blocking 
communication device having at least two more 
ports than the number of C-processors in the modu- 
le. When evaluating a program a C-processor detects 
concurency, calls for non-busy C-processors, and 
if any are available, initiates evaluation of 
concurent sub-programs on different C-processors 
(or else behaves like a sequential processor). 


INTRODUCTION 


An FP language (see J. Backus [Bac ] for details) 
consists in a set X of objects, a set F of basic 
functions mapping objects into objects, a set A 
of function names and a definition operation def 
(def a = f means a is the name of the function f), 
a set C of function constructors forming new func— 
tions by combining objects, existing functions, 
and names of defined ones (A denotes composition) 
and, an execution command : (f : x means f is 
executed on the object x). 

An FP program is such a function. 

Some constructors can express concurency in 
programs,e.g. : construction , [f,,...,f. ] means 
Lioeeesf are to be executed concurently. 


Programs in von-Neumann languages (FORTRAN, PASCAL 
...) can be translated into FP programs [Vil ], 
revealling their built-in concurency. 


STRUCTURE OF SERFRE 


SERFRE is a multi-processor command-driven (string 
reduction) machine having only a few different 
components (it is a VSLI architecture). It direc- 
tly executes a FP language, trying to have sub- 
programs executed on different processors. It is 

a dynamic loosely-coupled system using direct 
communication with storage of messages. 

Figure 1 describes the architecture of a possible, 
single-user implementation of SERFRE, and, 

figure 2 the structure of a module. 

The I/0-processor either updates memories of the 
C~processors when a data or a program definition 
is given, or, initiates a program evaluation and 
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returns the result to the user, or, takes care of 
local memories overflows by swapping on secondary 
storages. | 

Each C-processor has it own memory (working as a 
ROM for him) containing definitions of data and 

of programs (defined functions). 

C-processor exchange messages of the form:<receiver 
address, program, data, sender address>. When a 
C-processor has to evaluate a function formed by 

a constructor involving concurency, it calls the 


(strictly non-blocking) communication device of its mo- 


dule which eventually calls other modules ones, 
for non-busy C-processors, and, sends them the 
concurent sub-programs and the data to execute 
and waits for them to return their results, if 
any are available, or else evaluates them sequen- 
tially. 
A C-processor consists of a register for the 
return address (sender), a stack for the program 
(a place for each A-composed function), registers 
(variable-length arrays) for the data, and, a 
reduction engine which first takes the top of the 
program stack, then checks whether this is a basic 
function (and evaluates it on the data and puts 
the result in the data registers), or the name of 
a defined function in the memory (and puts its 
definition on the top of the program stack and 
carries on), or function formed using a construc- 
tor involving concurency (and calls for C-proces 
sors for evaluating the concurent sub~programs 
and waits for their results). It contains a stack 
for intermediate results in case of sequential 
evaluation of recursively defined functions, and, 
of concurent sub-programs. 
When the program stack is empty, it calls the 
C-processor corresponding to the return-address 
and sends him the message < return address, empty, 
data, his address>. 
A full description of several proposed implementa- 
tions of SERFRE is given in the report submitted 
to the French Office of Patents. 


OTHER DESIGNS 


TRELEAVEN and al. review [T1,T2] the proposed 

demand-driven architectures. 

SERFRE compares to TRELEAVEN and MOLES design[T2 ] 

but, it does not have the global memory bottle-neck 
and has a more powerfull mean of communica- 

tion between processors. 

MAGO'S design [MAG] is a tree organized system 

and seems to waste a lot of time in communication 

between processors,what we have tried to minima- 

lize. 
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Abstract 


The Computation Structures 
defined and described herein 
specification 
multi-phase 


Language (CSL) 
is a vehicle for 

and programming of multi-type, 
parallel computation architectures. 
The design principles for CSL include: (i) 
separation of parallel structuring from 
sequential computation, (ii) use of higher level 
.language modules as the primitive execution 
elements, (iii) provision for multiple modes of 
data sharing and interprocess communication and 
(iv) capabilities for the replication of 
instantiations of program units and communication 


channels. The formal computation model for CSL 
is an extended form of colored Petri nets. The 
motivation for development of this programming 
system is the availability of reconfigurable 
network architectured computer systems which can 
implement multi-type/multi-phase parallel 
computer architectures. Specification of 
parallel architectures and programming of 


parallel architecture are separately discussed. 
The concepts are illustrated by examples. 


Specification and Programming 
of Parallel Architectures 


Micro-electronic technology allows computer 


architectures implementing high degrees of 
programmable parallelism of several types and 
even architectures capable of dynamically 


reconfiguring to different types and degrees of 
parallelism and communication geometry [KART77, 
VIC79, SIE78, BRA79]. The Texas Reconfigurable 
Array Computer (TRAC) [SEJ80, PRE80, KAP80, 
JEN81] is a practical example of such an 
architecture. The Computation Structures 
Language (CSL) defined and described in this 
paper is a vehicle for exploiting such 
architectures. CSL implements both specification 
of paraltel computation structures and 
programming of these parallel architectures. CSL 
Supports dynamic structuring of computations 
through multiple phases each of which may display 
different types and degrees of concurrency and 
differencing requirements for sharing of data and 


interprocess communication. The model of 
computation implemented by CSL is that of an 
extended form [KAP82] of colored Petri nets 


[PET80]. A modeling system for analysis of the 
execution behavior of CSL programs has_ been 
developed utilizing this correspondence to 


colored Petri nets. 


CSL is being developed in the context of the 


Texas Reconfigurable Array Computer Project 
(TRAC). Its implementation will utilize the 
unique capabilities of the TRAC architecture for 
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representation of a wide 
computation 


spectrum of parallel 
architectures for dynamically 
reconfiguring itself between these parallel 
architectures. TRAC can implement any MIMD 
configuration of SISD or SIMD tasks within the 
span of its resource set. (See [SEJ80] for a 
brief description of TRAC.) The concept base of 
CSL is, however, in large measure independent of 
any particular architecture. CSL could be 
compiled for a conventional serial architecture 
or for a vector processor. The prototype CSL 
interpreter is indeed being developed in Pascal 
with simulation of parallel process executions on 
a DEC PDP—10, 


A Rationale for Multi-type/Multi-phase 
Parallel Computation 


It is well known that applicability of a 
computer architecture which implements any one 
fixed type of parallelism has been limited by the 
difficulty of mapping an extensive set of 
problems to execute efficiently on any single 
type of parallel structure. Vector streaming 
parallelism has been effective on many large 


scale numerical problems [VOI77,JOR77]. SIMD 
parallel of fixed degree and fixed 
interconnections structure has been found to be 
effective on a limited class of problems 
[SAM78,KUC77]. The limited effectiveness of 
these architectures has often been hard-won. 
There are three reasons for this historically 


experienced difficulty. 

1. Mapping complex computations to a Single 
architecture may lead to significant 
portions of the computations being based 
on high operation count algorithms. 


It is often the case that a computation 
will pass through several phases each of 
which need a different parallel 
structure or different degrees of 
parallelism for efficient realization. 


the communication 
requirements of a computation upon a 
Single fixed interconnection geometry 
may lead to heavy data movement costs 
(GEN78]. 


Mapping of 


Recent investigations of the interconnection 


geometries required for efficient execution of 
such significant tasks as solution of Poisson's 
Equation [GRO79] as finite element equations 
[GAN81] have shown that no single type of 
interconnection network is suitable for these 


problems. Kapur and Browne [KAP81] have 
decomposed the solution of block tri-diagonal 
linear systems into its natural computation 
structures and find three different basic modes 


of interconnection are required in the absence of 
a paracomputer architecture. (A paracomputer is 
a multiprocessor which implements’ conflict-—free 
access to common memory [SCH80].) 


These factors lead naturally to 
investigation of programming principles for 
multi-type/multi-—phase parallel computation 


structures. 


OS ES AS A A <r Ae—SS 


for Multi-type/Multi-Phase Parallel Programming 


MULTI-PHASE PARALLEL PROGRAMMING 


Parallel programming adds_ to sequential 
programming the requirement for definition and 
programming of protocols to govern the 
interactions of the concurrently executing 
processes. A language system for parallel 
programming must therefore include the following 
capabilities above those for sequential 
programming: 

*definition and control of concurrently 

executing processes 

*definition of mechanisms for interprocess 

communication 

*definition of mechanisms for correct and 


efficient sharing of data 


Multi-type/multi-phase parallel programming 
adds the further requirement that the process and 


communication structures be specifiable at run 
time and also be reconfigurable as the 
computation progresses through its phases. It is 


also often necessary to pass results obtained in 
one phase to a later phase. We have also found 
in our attempts to write parallel programs that 
there is a need for convenient and flexible means 
of creating multiple instantiations of given 
program units and communication channels between 
program units. 


CSL implements these requirements in a 
formulation based upon four design principles. 


1. separation of specification and 
programming of concurrency and 
interprocess communication from the 
programming of sequential computation 
units 


use of the separably compilable units of 
a higher level language as the execution 
units from which the parallel 
computation structures will be composed. 


inclusion of both shared address space 
and address space data transfer modes of 
communication. 
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recognition that in parallel programming 


processes and communication channels 
must have structure and multiple 
occurrences just as does data in 


sequential programs. 


These four principles are the foundation for 
a straightforward but flexible language system 


for efficient parallel programming and effective 
utilization of dynamic reconfiguration. 
CSL deals only with definition and _ control 


of computation structures and specification and 
implementation of interprocess communication. 
Execution units (tasks or processes) are 
sequential programs which can be written in any 
language (Pascal for the current - TRAC 
implementation). CSL provides capabilities for 
composing execution units into computation 
structures, for defining the mechanisms and 


protocols for communication between tasks and for 
initiating and controlling task executions. CSL 
provides only that computational power necessary 
to implement task control. This approach is to 
be contrasted to that of more _ conventional 
languages for parallel programming such as PL/I, 
ALGOL68, and Concurrent Pascal which have added 
specific concurrent control features to general 
purpose programming languages. CSL separates the 
programming of elementary execution units and the 


composition and control of computation 
structures. This separation of conceptually 
dis joint problem domains leads to a clear 


Specification and programming interface. 


Use of separately compilable units of higher 
level languages as the unit of composition for 
parallel computations allows flexibility in unit 
Size. The execution units do not know whether 
they are executing independently or as a part of 
a concurrent process set. The CSL programmer 
does require knowledge of which data structures 


in the tasks will be shared or involved in 
communication. Indexing of programs is 
implemented as a means of replication of many 


program units executing on different data _ sets. 
This of course also requires’ the indexing of 
communication channels coupling the program 
occurrences. 

CSL is a block structured language which 
uses macro-definitions as a code compression 
device. This choice was dictated by the 
convenience offered for the interpretive 


implementation planned. 
be recursive. 


Macro declarations may 


CSL supports both the sharing of variables 
(overlapping address spaces) between sets of 


processes and data transfer between disjoint 
address Spaces. TRAC architecture supports 
efficient implementation of both mechanisms 
[SEJ80]. Overlapping address Spaces are 


supported by creating network connections linking 
a given physical memory module to more than one 
processor. The sharing of access is on a logical 
basis rather than a physical access basis. A 


processor attaches a shared memory through a 
priviledged instruction. The attach will not be 
honored by the network hardware until the 
requested memory module has been released by its 
current holder (if it is currently attached). A 
segment of memory is thus switched from address 
space to address space rather being shared across 


otherwise disjoint address spaces. The logical 
concept of data sharing in CSL is, however, 
specified independently of the implementation 
model. 


CSL can be thought of as representing the 
logical endpoint of an operating system job 
control language. A CSL program is a job control 
program for an environment of great flexibility. 
It represents a prototype for the job control 
language for many component and reconfigurable 
architectures. 


sections 


The subsequent sketch and 
illustrate some of the capabilities of CSL. The 
Users Reference Manual [ADI81] gives a full 
definition of each statement in the language. 

Specification of Parallel Architecture 

The specifications for a computational 
architecture are written in terms of logical 
program elements such as tasks, shared data and 


logical channels between tasks rather than in 
terms of device or processor. properties. The 
Specifications for a structure of a given type 


and phase is bound by a CONSTRUCT statement. The 


architecture bound by a CONSTRUCT statement 
remains in effect until the execution path 
encounters another CONSTRUCT Statement. 
CONSTRUCT statements can be nested. TASKS, 
CONDITIONS and CHANNELS, the elements of a 
computation structure, are defined within a 
CONSTRUCT statement. The effect of a CONSTRUCT 


is for the system resource scheduler to configure 
an architecture conforming to the specification 
and to map the TASKS and CHANNELS upon this 
architecture. A User's Reference Manual which 
gives examples of each statement as well as a 
full syntactic and semantic definition is 
available. The most effective exposition of CSL 
is, however, by example. Figure 1 is a CONSTRUCT 
Statement for TASKS and shared variables taken 
from an example program given in full detail in 
Appendix A. 


CONSTRUCT 
TASKS 
t2(i) >: C{s(i),s(i+1),s(i+2)];3 
t2(j-1) : CLs(j-2),s(j-1),8(j)];3 
t2(m) : CLs(m-1),s(m),s(m+1),s(m+2) ] 
RANGE m = i+2 TO j—3 BY 2; 
Figure 1: Example of a CSL Construct Statement 


The first thing to notice is that the TASK 
declaration, t2(m), is based on indexing. It is 
often the case that many invocations of identical 
processing are required. Indexing gives a 
convenient method for specification of the number 
of identical process replicas and also for 


associating data with each invocation. C 
declares the file upon which the program code is 
to be found. The [s(i), s(i+1), s(i+2)] 


associates with t(i) shared data elements s(i), 


s(i+1) and s(i+2). The actual structure of s is 
defined within the task code. It might be a 
column of an array or an entire array. The 
declaration of  s(i) s(i+1) and s(i+2) as 
associated with t2(i) notifies the system 
scheduler to establish a memory configuration 


where t2(i) can access s(i), s(i+1) and s(i+2). 
The RANGE declaration specifies the number of 
tasks and shared data elements to be instantiated 
in this configuration. 


A CONDITION may be associated with each 
TASK. It becomes a variable shared by the TASK 
and the CSL program. CONDITION's are the only 
overlap between the address space of tasks and 
the controlling CSL program. 


CHANNEL's implement 1 to N 
between tasks. DATACHANNEL's are declared for 
the movement of high volume data between task 
address spaces. MESSAGE CHANNEL's are declared 
for the movement of control information and _ low- 
volume data. Different implementations are used 
for the two channels’ constructs. There are a 
number of extended declaration capabilities such 
as specification of the number of buffers 
associated with a CHANNEL. Figure 2 isan 
example of a DATACHANNEL declaration taken from 
the program given in Appendix A. 


communications 


CHANNELS 


(moveright[i] = DATACHANNEL FROM f-c-p(i) TO 


f-c-p((i+1) MOD (N+1)); 


moveleft[i] = DATACHANNEL FROM f-c-p(i) TO 
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f-c-p((i+N) MOD (N+1))3) 


RANGE i = O TO N; 


Figure 2: Channel Declarations 


CONSTRUCT statements must appear prior to 
the execution of the tasks or use of the channels 
they define. The appearance of a CONSTRUCT 
statement in the execution path of a CSL program 
voids previous CONSTRUCT statements with release 
of all resources unless the CONSTRUCT statement 
encountered is contained in a nested parallel 
structure (nested COBEGIN, see Section 5). The 
RETAIN statement provides an exception to release 
of resources. The memories containing the data 
elements specified in a RETAIN statement will be 
(logically) retained in the system for further 
processing within a subsequently encountered 
CONSTRUCT statement. 


Parallel Programming with CSL 


The three additional tasks of parallel 
programming (over sequential programming) are: 


1. establishment and control of concurrent 
execution of tasks 


2. implementation of interprocess 


communication 


3. control of access to shared data 


CSL attempts to uSe aS Sparse a syntax set as is 
consistent with ease of programming. 


We describe here on those language 
constructs relevant to parallel programming. 
Assignment, repetition and sequential control 
flow statements are minimal in number, 
Pascal-like in structure and will not be 
discussed herein. 


Execution control is implemented by eight 
statements. EXECUTE, COBEGIN-COEND, TERMINATE, 
STOP, CONTINUE, WAIT, SIGNAL, RESET. Let us 
again refer to an example (Figure 3). 


WITH s(m), s(m+1) DO EXECUTE t2(m) 
RANGE m = n TO j-1 BY 2; 


Figure 3: Example of a WITH Statement 


EXECUTE is followed by a list of task names. 
These tasks execute (logically) in parallel. 
Communication between the list of executing tasks 
can only be through shared variables. This is 
the characteristic statement for expressing SIMD 
executions. The statements contained between a 
COBEGIN-COEND pair are logically executed in 
parallel. They will often, however, contain 
synchronization statements such as 
WAIT/SIGNAL/RESET or CHANNEL commands’ such as 
SEND/RECEIVE. This is the mode for expressing 
MIMD or pipelined processing. 


Access to shared data is governed by the 
WITH statement. The task(s) named in the execute 
begin execution whenever they can have exclusive 
access to the shared variables contained in the 
WITH statement. The WITH construct in Figure 3 
is used for exclusive access to the variables 
s(m)} and s(m+1) for Task t2(m) for even m. The 
WITH construct also implements in its extended 
formats read-only access and exclusive access to 
some data items and read-only to others. 


The WHEN statement implements the Dijkstra 
[DIJ75] guarded command construct. It is 
normally used on conditions defined in the CSL or 
on the standard system generated conditions on 
CHANNELS. WAIT/SIGNAL/ RESET complete the 
synchronization constructs. They are functions 
defined upon CONDITION variables and have the 
obvious semantics. ; 


Interprocess communication between 
independently executing tasks is via CHANNELS. 
SEND and RECEIVE are illustrated in Figure 4. 


(SEND leftarr TO moveleft[i] 
SEND rtarr TO moveright[il; 
RECEIVE newright FROM moveleft[(i+1) MOD (N+1)]; 


RECEIVE newleft FROM moveright[(i+N) MOD (N+1)]; ) 


RANGE i = O TO N-1; 


Figure 4; Example of Data Movement through CHANNELS 


SEND places a data item on a channel. it is then 
available for the tasks declared as eligible to 
RECEIVE. A SEND blocks only if no buffer space 
is available while a RECEIVE blocks unless there 
is a data item on the channel specified in the 
command. A RECEIVE on a data channel "removes" 
the data from the channel and decrements’ the 
count of expected receives. The message is 
removed only when all expected RECEIVE's have 
been executed. The action of a SEND/RECEIVE is 
to transfer values between data structures 
defined in task address’ spaces. These data 
structures must be declared in the outermost 
block of the Pascal programs defining the tasks. 
Data appearing in shared data declarations cannot 
be sent via channels. 
The CSL Computation Model 


The existence of a formal computation model 
for a programming language offers a number of 
Significant advantages. It allows an assessment 
of the power and applicability of the language. 
It guides the development and verification of 
correct programs. It provides the foundation for 
performance analysis of programs written in the 
language. 


The formal computation model for CSL is an 
extended form of colored Petri nets [PET80]. The 
principal logical extension is to partition 
places into HOLD and ENABLE regions. This is 
illustrated in Figure 5. 
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Figure 5: Extended Colored Petri Net Segment 


The HOLD region simply holds tokens until the 
state of the transition and input places enables 
some one token in each place to participate in a 
firing. The algebraic representation of such a 
system is given by defining Set Operation Systems 
[KAP82]. Set operation systems generalize Vector 
Replacement Systems [KEL78] by allowing sets of 
"tokens" to be present at a place. The concept 
correspondence between CSL and colored Petri nets 
is given in Table 1. The actual modeling system 
also incorporates interval clocks in order to 
support performance evaluation of the execution 


of CSL programs. 


Colored Petri Net 
Constructs 


CSL Constructs 


shared data elements, 
data elements sent 
through CHANNELS 


TASKS _ 


colored tokens 


transitions 


WITH statements, 
CHANNEL buffers 
Conditions 


places 


Table 1 - Correspondence between 
CSL constructs and colored 
Petri nets 


Implementation Structure 


A full 
implementation | 


feasibility demonstration for 

of the CSL based programming 
System for parallel programming was executed 
before the design given here was. adopted. 
Figures 6 and 7 schematically show the structure 
of the system and the relationship of the several 
components. The responsibility of the several 
components is as follows. 


System scheduler 


JM - job monitor 
T1,T2,...,Tn - n tasks 
<——> control message flow 


<-> data message flow 


Figure 6 


Each job, which consists of a reconfigurable 


set of tasks, is driven by a job monitor. The 
job monitor consists of four components. The CSL 
program specifies the configurations for the 


computational structure and the parallel process 
execution. The CSL run-time system is an 
interpreter for the declaration and executable 
Statements of the CSL program. The configuration 
analyzer loads the CSL program and scans it for 
CONSTRUCT statements. The configuration analyzer 
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scene 
(job control 
policy module) 


JCPM 
interpreter 


logical 
message 
handler 


resource 
request 
generator 


message 
arrival 


message 


broadcast 
processor : 


resident 
monitor 


Figure / 


then negotiates with the system scheduler for the 
establishment of resources and parallel 
architectured structures to conform to the 
computational architecture specified in the 
CONSTRUCT statements. Whenever a configuration 
has been established, control is turned over to 
the appropriate executable portion of the CSL 
program. This program is then interpreted by the 
CSL run-time system. Control of the tasks of a 
CSL program is attained through the sending of 
messages to and from the CSL program running the 


job monitor and the tasks executing in the 
several task processors. Task executions are 
controlled through the sending of messages via 
packets in the TRAC network. SEND's and 


RECEIVE's on data channels are executed through 
the switchable memory concept of TRAC. SEND's 
and RECEIVE's on message channels are implemented 
by packet movement. 


It is necessary for the processor resident 
monitor of each task to understand the data 
structures which it is to send and receive. 
Accordingly, the configuration analyzer 
initializes the processor resident monitor with 
the locations and characteristics of the data 


Structures which it is to send and to receive. 


The initiation of transfers of data via the 
Shared switchable memory mechanism is also 
initiated via packets being sent to the 


appropriate processor resident monitors. The 
details of this communication are given ina 


report, "Processor Resident Monitor of TRAC", 
[CAN81]. 
The implementation of the CSL interpreter 


has been completed on the DEC-10 in Pascal. It 
is serving as a simulator where the task code is 
replaced by dummy stubs. A design for the job 
configuration analyzer has been made down to the 
level of Pascal data structures and flow of 


control through functions and _ procedures. The 


processor resident monitor has been resolved down 
to the streams of flow of control through the 
modules and functions for all major tasks 


including loading of task modules, initialization 
procedures, acquire and release of shared 
memories, handling of page faults and handling of 
packet arrivals. 


Summar y 
This paper has described a programming 
system designed to exploit the capabilities of 
reconfigurable multi-processor architectures. 


The feasibility of this programming system has 
been established and a number of non-trivial 
programs coded in this system to demonstrate its 
applicability. 
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Appendix A - An Example 


This CSL program is a complete parallel 
formulation for a particle-in-cell code. The 
details of the problem formulation can be found 
in TRAC project technical report [BRO81]. The 


major portion of the code (MACRO's oerr and oerb) 
is a parallel structuring of Odd-Even Reduction 
[KAP82]. The MACRO poisson combines oerr and 
oerb to solve Poisson's equation. The main 
program alternates calls to the Poisson equation 
solver and the routine f-c-—p which computes 
fields and move charges. Each copy of f-c=p 
computes the field and moves the charges in a 
column of "cells", The copies of f-c-p must 
communicate with their nearest neighbors in order 
to compute fields and hand particles to the 
columns where they move ina given time step. 
The major execution steps of the program are 
contained in the last 10 lines of the program, 


Program Pictst; 


VAR 
i,j,N,col,p : INTEGER; 


MACRO oerr(s,i,j,k); 
{definition for reduction step of Odd even} 


VAR m : INTEGER; 


BEGIN 
IF k>1 THEN 
{reduction step number} 
BEGIN 
{phase 1: diagonal block solution} 
CONSTRUCT 
TASKS t1(m) : BEs(m)] RANGE m:= i TO j3 
END; 
EXECUTE t1(m) RANGE m = i TO j3 
END; 
BEGIN 
{phase 2: merging neighboring rows } 
{using 2-pole/3-position switch } 
CONSTRUCT 
‘TASKS 
t2(i) : C{s(i),s(i+1) ,s(i+2)]; 
t2(j-1) : Cls(j-2),s(j-1),8(J)]5 
t2(m) : Cl{s(m-1),s(m),s(m+1),s(m+2)] 
RANGE m = i+2 TO j-3 BY 2; 
END; 
COBEGIN 


//WITH s(i) DO EXECUTE t2(i); 
//WITH s(m=-1),s(m) DO EXECUTE t2(m) 
RANGE m = i+2 TO j-1 BY 23 

COEND; 


FOR n = i TO i+1 DO 
WITH s(m),s(m+1) DO EXECUTE t2(m) 
RANGE m = n TO jel BY 2; 


COBEGIN 
//WITH s(m+1),s(m+2) DO EXECUTE t2(m) 
RANGE m = i TO j-3 by 23 
//WITH s(j) DO EXECUTE t2(j-1); 
COEND; 


BEGIN 
{inverse perfect shuffle} 
CONSTRUCT 
TASKS 
t3(2m—1) :D[s(2m—-1) ,s(m)] 
RANGE m = i TO j/23 
+3(2m) :DEs( 2m) ,s(m+( j-i)/2) 
RANGE mzi+1 TO j/2 -1; 
END: {construct} 


WITH s(m) DO EXECUTE t3(m) RANGE m = i TO j-1; 


COBEGIN 
//WITH s(m) DO EXECUTE t3(2m-1) 
RANGE m = i TO j/2; 
//WITH s(m+(j-i)/2) DO EXECUTE t3(2m) 
RANGE m = i TO (j-—1)/2; 
COEND; 
END; 


k s= k - 13 
RELEASE; 
oerr (s,(j+i)/2,j,k); 
{invoke next pass of reduction} 
CONSTRUCT 
TASKS solve ; E; 
END; {construct} 
EXECUTE solve; 
{solution of single block returned by reduction} 


ENDMACRO; {oerr} 
MACRO oerb(s,j); {back substitution for oer} 


{j:block dimensionality of original matrix....} 
{..e+..-.power of 2 minus 1 } 
{k:step number, initially, k=1 } 


VAR m : INTEGER; 


BEGIN 
FOR k = 1 TO (log(j+1)-1) DO 
BEGIN 
CONSTRUCT 
TASKS 
b(m) : Fls(m—(j+1)/2**(k+1)), s(m), 
s(m+(j+1)/2**(k+1))] 
RANGE m =( j+1/2**(k+1) 
TO j+1—( j+1)/2**(k+1) 
BY (j+1)/2**(k+1); 
END; {of construct} 


( WITH s(m) ,s(m+(j+1)/2#*(k+1)) DO EXECUTE b(m) 
WITH s(m—( j+1)/2**(k+1)),s(m) DO EXECUTE b(m)) 
RANGE m=( j+1)/2**®(k+1) TO j+1=(j+1)/2**(k+1) 
BY (j+1)/ 2%*(k+1) 3 
END; {for} 
ENDMACRO; {oerb} 


MACRO poisson (s,j); 


BEGIN 
oerr(s,1,j,log(j)); 
oerb(s,j); 

ENDMACRO; {poisson} 


BEGIN {main program} 
{ initialize 
p : the number of processes 
N : total number of columns in the grid 
col : no. of columns assigned to each process 
M : number of iterations required 
NOTE: 1) p = N/col 
2) each process or task also 
needs the two columns 
adjacent to those assigned to it.} 
CONSTRUCT 
TASKS 
f-c-p(0) (init,move,charge) : Cfile[{qv(N) ,qv(j)] 


RANGE j = 1 TO col+¢1; 
f-c-p(i) (init,move,charge) ; Cfile{qv(it*col+j)] 
RANGE ((j=0 TO col+1),(i=1 TO N-2)); 
f-c-p(p) (init,move,charge) 
: Cfile(qv(1),qv((N=1)#col +j)) 
RANGE j = 0 TO col; 
END; {of construct} 
(EXECUTE f-c-p(i).init; {initialization} 
EXECUTE f-c-p(i).charge; 
{maps charges from particle positions 
to mesh points } 
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FOR j = 1 TO M DO 
BEGIN 
poisson (qv,i); {solves poisson equation} 
COBEGIN 
CONSTRUCT 
TASKS 
f-c-p(0) (init,move,charge) : Cfile[qv(N) ,qv(j)] 
RANGE j = 1 TO col+13 
f-c-p(i) (init,move,charge) : Cfile[qv(i*col+j)] 
RANGE ((j=0 TO col+1),(i=1 TO N-2)); 
f-c-p(p) (init,move,charge) 
: Cfilelqv(1),qv((N-1)*col +j)] 
RANGE j = 0 TO col; 
CHANNELS 
(moveright{i] = DATACHANNEL FROM f-c-p(i) TO 
f-c-p((i+1) mod (N+1))3 
moveleft{iJ] = DATACHANNEL FROM f-c-p(i) TO 
f-c-p((i+N) MOD (N+1))3) 
RANGE i = 0 TON; 
END; {of construct} 


//( EXECUTE f-c-p(i).move; 
{each task moves its particles} 
SEND leftarr TO moveleft[{i]; 
SEND rtarr TO moveright[i]; 
RECEIVE newright FROM 
moveleft[(i+1) MOD (N+1)]; 
RECEIVE newleft FROM 
moveright£(i+N) MOD (N+1)]3 
{information about particles that 
crossed partitions is sent to 
adjacent tasks } 
EXECUTE f-c-p(i).charge;) 
{calculates new charge distribution} 
RANGE i = 0 TO N-1; 
COEND: 
END; 
END. 
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ALGEBRA OF EVENTS : 
A MODEL FOR PARALLEL AND REAL TIME SYSTEMS 


P. CASPI, N. HALBWACHS 
IMAG Laboratory 
Grenoble, FRANCE 


Abstract: The model presented here differs 
from the usual models of parallel processing by 
two aspects: On one hand, it takes fully into ac-— 
count the metric notion of time, thus allowing the 
description of hard real time systems. On the 
other hand, it is a pure behavioural model, in the 
sense that it does not use any abstract machine 
notion. From a formalization of the notion of 
event, 
system may be described, by means of few opera- 
tors, in a precise and concise way. The algebraic 
properties of the nodel are then studied, in order 
to define some methods for analysing or transfor- 
ming systems described in this formalism. 


INTRODUCTION 


Two different notions of time are used in 
system modeling. In sequential systems, as far as 
time performance is not considered, the time 
concept may be reduced to the ordering of actions, 
or more generally of events occurring during the 
system life, that is a perfectly known total 
ordering relationship. In parallel systems, the 
ordering of events depends on the execution time 
of the actions. So a precise description of such a 
system needs the usual metric notion of tine. 
However, since the execution times are generally 
uncknown, the correctness of parallel systems is 
commonly required to hold independently of any 
assumption about the speeds of the involved pro- 
cessors. So, many authors were led to consider the 
ordering of events in a parallel system as a 
partial ordering, and to assimilate parallel 
systems with undeterministic sequential ones. This 
approach allows to get rid of any netric notion of 
time, and has led to most of the parallel programs 
proof techniques. However it does not apply as 
soon as real time systems are considered. In such 
systems, the metric notion of time is used not 
only to compare the performances of several imple- 
mentations, but also to decide of the adequacy of 
a system to its specifications. 


Another characteristic of many approaches to 
parallel behaviour modeling (for instance [1],[7]) 
is the use of an abstract machine model, nore or 
less derived from finite state automata. A beha- 
viour is defined as an equivalence class upon the 
set of machines, and thus the proof of a system 
reduces to the proof of the equivalence between 
the abstract machines representing the specifica- 
tion and the implementation of the system. The 
drawbacks of such an operational approach for the 
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we show that the behaviour of a logical 
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initial specification process have been pointed 
out in [3]. In short, the specification language 
is generally far from being natural, and may lead 
to overspecification. 


In this paper, we present a purely behaviou- 
ral model for logical, parallel or real time 
systems, which takes fully into account the real 
time dependencies between internal and external 
events of a system. Our notion of time may ke 
viewed as a simple ordering time, as far as purely 
parallel systems are considered, or as a netric 
time, assumed to be the global time of an external 
observer to the system. 


In section 1, the basic notions of time and 
event are defined. An event is represented by an 
increasing staircase function from time to non 
negative integers, which counts the number of 
occurrences of the event during the time. An 
ordering relationship and a set of operators are 
provided in section 2, that structure the set of 
events as an ordered semiring. To illustrate the 
descriptive power of this algebra, we show 
(section 3) that finite state machines and Petri 
net models may be specified by systems of linear 
equations and inequalities over events. In order 
to define an effective calculus on such specifica- 
tions, the algebra is extended in section 4 to 
become a ring, the elements of which are called 
pseudoevents. The use of this calculus to real 
time systems design problems is illustrated in 
section 5. Section 6 describes a systematic method 
to get approximate results about descriptions in 
our model, by means of discrete transforms of 
pseudoevents. Some nice properties of the algebra, 
when the time may be considered as discrete, are 
given in section 7. In conclusion, the extension 
of the model towards numerical systems is discus- 
sed, and open problems are set, the solution of 
which would greatly increase the capabilities of 
our calculus. Most proofs have been oamitted, but 
may be found in an extended version of this paper 


[5]. 


1. TIME AND EVENTS 
1.1 Time 


Our notion of time refers to an absolute ome, 
such as perceived by an external observer to the 
system. At the description level, the problem of 
the relative times measured by ~ several 
subsystems 'clocks in a distributed system, such as 
studied in [6], does not arise. We shall generally 
model the set T of times by the set Ror Zof real 
or integer numbers. Elements of T are called times 
or instants when T is considered as an affine 


space, and time intervals, delays or durations 
when the vectorial structure of T’ is considered. 


1.2 Events 


We consider as events the transitions between 
states that may appear either in a system or in 
its environment, such as setting a switch, or 
assigning a new value to a variable. Moreover, an 
event may occur several times during the period of 
observation of the system, but, as we deal with 
discrete systems, the set of occurrences of an 
event is assumed to be enumerable. At a suitable 
level of abstraction, we can decide that an occur- 
rence of an event has no duration, and can be 
viewed as a cut in the time line, that separates 
the times before and after the event occurs. Thus 
we define an event e to be a finite or infinite 
increasing sequence of instants, where e(n) 
denotes the instant of the n-th occurrence of e. 
We shall furthermore impose that, if the sequence 
is infinite, it converges towards + in T with n. 
This restriction is motivated by algebraic reasons 
and may be intuitively justified because, in 
discrete systems, an event may not occur infinite- 
ly often in a finite amount of tire. 


The number of occurrences of an event e will 
be noted #e. For convenience, we do not prevent an 
event fran having several simultaneous occurren- 
ces. The set of events will be noted E(T), or 
simply E when the choice of T is irrelevant. 


Of course, this definition of events copes 
with real time behaviour modeling. However, it is 
also convenient to describe sequential or purely 
parallel systems: For instance, if L is a language 
on a vocabulary V, we can associate with each 
synbol a in V and with each string c in L, the 
event a which is the increasing sequence of the 
ranks of the symbol a inc. 


This representation by means of sequences 
allows us to equally handle the present, the past 
and the future of the system. This is close to the 
point of view adopted, for instance, in the appli- 
cative language LUCID [2]. 


1.3 Counters 


An alternative way for handling events 
consists of using counters. Such counters have 
appeared useful in describing or programming 
synchronization between processes [10],[11]. With 
each event e, we shall associate a counter iu, 
Which is an application fromT to NN ,, defined Ss 
follows: 

Vte Ty p(t) = max{nJ1<n< #e & e(n)<t} , 

Thus p(t) measures the number of occurrences 
of e that have happened strictly before t. yp». is 
an increasing, left continuous staircase function 
on Tf. Figure 1 pictures the counter of the event 
e=(1,3,4,6). 

Let an event counter be an increasing, left 
continuous total function fronT to IN, which 
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value is zero on some interval ]-~,x,], then 
(using the Church's lambda notation), p= dre-pe is 
obviously a bijection between the set of events 
and the set of event counters, since: 


Vn=l..#e, e(n)= mx{ teT | p(t) < n} 


2. THE ALGEBRA OF EVENTS 


A logical system behaviour will be considered 
as a vector of interrelated events, and a system 
as a set of such behaviours. In this section, we 
shall see how to specify such a system by means of 
few operators over events. 


2.1 Primary Events 


* 
If ke IN, the primary event k is, by defini- 
tion, the event which has exactly k occurrences, 
simultaneously happening at the instant zero: 


Yn=1..k, k(n)=0 
My. (t)= if t<O then O else k 
Since the instant zero will generally repre- 


sent the initial instant in a system life, primary 
events will.be often used to model initial states. 


2.2 Ordering over E 


For every e,f in E, let 


e<f <= #e < #£ & Yn=1..#e, e(n) > £(n) 
<> Vte Tl u(t)< U_e(t) 


So the (partial) ordering over events coinci- 
des with the pointwise ordering over counters. 
This ordering will be useful, in particular, to 
represent causality relationships over events. 

(E,<) is a lattice, and we can define the 
inf and sup operators as follows: 

Ve,f E, 

Vinf(e,£)= At- min(ye(t) y¢e(t)) 


Usup(e,£)= Mt- max(ye(t), p_e(t)) 


E has a mininum element O, which is the event 
which has no occurrences (#0= 0). 


2.3 Sum and Difference of events 


The sum of two events e and f is defined to 
be the event which occurs each tine e or f occurs. 
More precisely, the sequence of occurrences of the 
event etf is built by interleaving the sequences 


of e and f, according to their temporal ordering. 
This notion can be easily formalized by means of 
counters, justifying the additive notation: 


Ve+f = At. Ue(t) + pe(t) 


The + operation, being obviously commutative 
and associative, may be generalized to an arbitra- 
ry finite nunber of operands: 

7 - k 

The product of an event e by a natural integer 

k is the k times iterated sum of e: 
_ K 
ke= he 

The difference over events is only a partial 
operation, the definition of which results from 
the definition of the sum: 


d= e-f = & df 


Note that the difference e-f is defined only 
if f is a subsequence of e. 


2.4 Delay Operators 


Let A be a delay, then the delay operator p* 
performs a translation of every occurrence of its 
operand according to A: 


= At.u (t—A) 
Mpde oe 


The exponential notation is justified by the 
obvious properties that D° is the identity opera- 
tor on E, and that D4p® pdt for every delay 
A,6 . The operator D! will be noted D. 


3. APPLICATION TO BEHAVIOURAL DESCRIPTION 


Let us show here that the preceding concepts 
are well suited to the description of parallel and 
real time systems, and lead to very concise 
descriptions of such systems. 


3.1 Periodic Events 


Let us express that an event e occurs at 
times 0, A, 2A,--., nA,... . Clearly, e satisfies 
the following recursive definition: 

A 
a= Det l 


Similarly, the weaker assumption that e 
occurs at positive instants, and that two succes- 
Sive occurrences of e are separated by a delay 
smaller than A may be expressed as follows: 

A 
e<Detl 


3.2 Response Times 


Let e be an input event to a system, and s be 
the output response to e, that is requested to 
occur within the time interval A following each 
occurrence of e. This can be expressed by: 


Dek exe 


_ These les int out, the usefulness of 
linear equat tons and inequalities over events. 
Evidence for such a fact will also be provided by 


the following application of our model to the 
behavioural description of finite state machines, 
Petri nets, and timed Petri nets. 


3.3 Finite State Machine 


Let M= (V,Q,0,q,) a finite state machine, 
where: 


- Vis a finite vocabulary 
- Qis a finite set of states 
- o is a mapping from QxQ to V 
- I,€Q is the initial state 

A behaviour of M is a string @ a aif Vee né 

g 199 5°29) 

of V such that exists a 
oxo one Mn such that, 
smaller than the length of c, o(q., Gnt]) exists 
and is equal to @,4)° Im our model, a behaviour of 
M will be a vector (4!a<€vV) of events, such that 
& is the sequence of the ranks of the symbol a in 
a string like c. 


there sequence 


of states, for every n 


First, we may describe, for every couple 


(q,q') of QxQ, the event "the transition q>q' is 


performed". Let qq ' be this event. For notational 


convenience, let q°® (resp.‘q) be the set of states 
q' such that o(q,q') (resp. o(q',q)) is defined. 


Then, by observing that a state q is left at 

“instan " n if and only if it was reached at 
"Instant" n-l and it has some successor state, we 
get : 


. For every state gq, such that q’#%, 


“aq? an = ete q@a"a + u(q) 


where u(q)= if (gq, then 1 else 0 
Now, for every a in V, the event a4 happens 
each "time" a transition q-+q' is performed, where 
o(q,q')=a. So: 


. For every a in V, 2 


~6(q,q' )=a €qq' 


152 


Example: Let us consider the state graph of 
figure 2. We get: 


e.. te 


12 147 


“45 ~ 


+e 


and: 
“12 4 ' 
fram which it follows that: 


a=DC+1 and b=DC+De-C+4+D 

We shall see in section 7 a necessary and 
sufficient condition for the difference in the 
last equation to be defined. With this additional 
condition, the above equations exactly characteri- 
ze the machine behaviours. Of course, the charac-— 


terization by means of regular expressions is much 


simpler, but the same process applies to nore 
complex machines, like conmmicating systems of 
[7],(8]. 

Now, let us see how the nodel applies to a 


parallel asynchronous language. 
3.4. Petri Nets 


Like state machines, Petri nets [9] mly use 
an ordering notion of time. So we shall choose T = 
2% and describe, for each transition of the net, 
the event "the transition is fired". 


Notations: Let P be the set of places, T be 
the set of transitions. For each place p and each 
transition a, let us denote: 

. p’ (resp.’p), the set of output (resp. input) 
transitions of p. 

» a° (resp.’a), the set of output (resp. input) 
places of a. 

Let m(p,0) be the initial marking of p, and 4 be 
the event which happens each time the transition a 
is fired. 


The transitions are fired me at a time, so 

the marking m(p,n) of the place p at the instant n 
is: 

m(p,n) = m(p,0) + Cpe *p p(n-1) - lacps 


Writting that this marking may not become 
negative, we get: 


ps (n) 


aA 


a « Bc DP + m(p,0) 


Now, we can write that at most one transition 
may be fired at each instant: 


Vp €P, 


(1) 


laep 2 < LyepPa +1 (2) 
(1) and (2) constitute a system of linear 
inequalities which characterize the set of correct 


behaviours of the net. 
3.5 Timed Petri Nets 
Of course, the preceding characterization of 


Petri nets may be extended to synchronous real 
time models such as timed Petri nets [13]. In such 
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nets, a delay A(p) is associated with each place 
p- The two following rules differenciate timed 
Petri nets from ordinary mes: 

. If a token reaches a place p at the instant t, 
it becomes unavailable until the instant t+A(p). A 
transition is enabled if and only if each of its 
input places contains an available token. 

. A transition may not remain enabled during a non 
null interval of time: It must be either fired or 
disabled as soon as it is enabled. 


The inequality (2) of ordinary Petri nets 
does not hold for timed nets, since several tran- 
sitions may be simultaneously fired. Taking the 
first rule into account, the system (1) becomes: 


a A(p) 
a€p* an Ape *p 


The second rule forces every event to be as 
large as possible, so the above system must 


Vpe P, = b + m(p,0) 


became: 

Yae T, 
oe . A(p) A - A 
a= inf ,e-a! D Dye +p? + m(p,0) *cep:-{a}© 


This system of equations characterize the set 
of correct behaviours of the net only if it does 
not contain so called "no duration loop", i.e. if 
it is impossible for a token to participate in the 
firing of a transition and to come back simultane- 
ously enabling this transition. Otherwise, the set 
of correct behaviours is qly a subset of the 
solutions of the system of equations: For instan- 
ce, if the delays associated with both places of 
the net of figure 3 are zeros, the only equatim 
we get is 4 = 56, though the true behaviour is 4 
B = 0, because of the null initial marking. 


a 


b 


Figure 3 


4. PSEUDO EVENTS 


In the previous section, we have illustrated 
the descriptive power of the model. Let us now 
look for transformation and proof techniques for 
such descriptions. Starting with an equation such 
as 

e= pe +] 


the approach taken here consists of giving a sense 
to the expressions: 


(1 - D)e =1 
and 


1-D 


This is achieved by extending the set of 
events so as to make the difference a total opera- 


tor, and by defining an internal product. 


from the definitions 
the 


Let us first note that, 
of the sum and delay operators over events, 
following identity holds for every event e: 


pen) | 
n=l = 


Our extension of the set of events staight- 
forwardly results from this identity. 


4.1 Definition 


A pseudo event is a formal series 


gHx & p(n) 

rl “n 
where: 
- (x,) is a sequence of non null relative inte- 
gers; 
; (x,) is a strictly increasing sequence of 
instants; 


- both sequences have the same length #x, which 
can be finite or infinite, but in the latter case, 
the sequence (x,) converges towards infinity. 


The pseudo event O is such that #0=0. With 
each pseudo event x can be associated in a me to 
one way its counter by defined by: 


b. = At. if x=O0 then 0 else xy 


a t “n 

The set R of pseudo events is provided with 
the usual sum and product operators over formal 
series. (R,+,x) is an integral, commutative ring 
with neutral elements 0 and 1=D°. 


A partial order is defined over pseudo events 
as follows: 


a <b VteT py(t) < w(t) 
(R,<) is a lattice, and the sup and inf ope- 
rators are the corresponding operators Mm 


counters. 


An event is either 0 or a pseudo event with 
positive coefficients x Thus its counter is an 
increasing function of t. One can see that these 
definitions are consistent with the previous ones 
given in sections 1 and 2, with the following 
loosened notations: 


Since, for every pseudo event a and every k 
in IN, ka=ka, we shall omit henceforth to subline 
the primary pseudo events. Since 1 is the neutral 
element of the product,it will be omitted in 
products. So D“ will denote the event D“1. Notice 
that, with these notations x the expression D°a may 
be viewed either as the D” operator applied to a, 
or as the product of D' by a. More generally, 
every pseudo event may be viewed as an operator ™m 
R. 


The product operation, and the above nota- 
tions justify the first step of the 2 ee of 
formal resolution of the equation e=D"et+ The 
second step will be justified by the aaa of 


invertibility in R (a pseudo event a has an. 
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inverse if and amily if there exists a’ such that 


aa'=1). 


4.2 Euclidean Division 


4.2.1 Proposition: A necessary and sufficient 
condition for a pseudo event x to have an inverse 


is that X = +] . Moreover, 
-x(1) 
=X od ~n0 Yoo 
where 
= sf px(n)-x(1) 
y = sign( x) ) alae nD 


and y denotes the n times iterated product of y 
by itself. 


4.2.2 Corollaries:. Let a=l-e, where e is an 
event such that e(1)>0, then the inverse of a is 
an event, since 1/a = I,30 el, 


. A necessary and sufficient condition for ai 
inverse of an event e to yh ie an ore De that «&€ 
for some A. Obviously 1/ lAceT is 
the set of unity elements of the aie (E,+,x). 


4.2.3 Ring norm: Let us recall that an appli- 
cation v froma ring R to WN is called a ring norm 
if and only if: 

» v(x) = <> x=0 
. v(xy) = v(x)v(y) 
. x has an inverse if and only if v(x)=1 

So the application v, which associates with 
each pseudo event a#0 the integer Jay), and such 
that v(0)=0 is a ring norm om R. 


4.2.4 Proposition: R is an Euclidean ring, 
i.e. for every a,b in R, (b#0), there exist q,r in 
R such that a=bgtr and v(r) < v(b). 


Let us give the division algorithm, which is 
very close to the polynomials division according 
to increasing variables powers: 


- Step O: Let r!O) a and ql?)= O; 


. Step k+l: 1£}z! Bj <J b b,| then stop. Else ,let 


x(k) orks, . re x)é xy then go to step a, else 
ieee 
r)7§-b(1) 
pag (Kp tf | lke glk)4 p(k), 
r(ktL)= p(k) p(k)p 
: tan x09" Gf t By) be the smallest integer greater 
if x O, the greatest integer smaller 
than xk ) otherwise. Let: 
k 
roy{- b(l) | 
xD CE} be ‘ ae g"*4 p, r= r!K)_ pp 


4.3 Linear Inequalities of Pseudo Events 


Our formal calculus is now powerful enough to 
solve any linear equation. However, behavioural 
specification in our model makes a very general 
use of linear inequalities, which are more diffi- 


cult to handle because of the partial nature of 
the ordering on R. So, let us examine some proper- 
ties of this ordering in relation with algebraic 
operators. 


4.3.1 Inequalities and sum: For every a,b,c 
in R, a>b = atc > btc. In other words, the 
sum and difference operators are order preserving. 


4.3.2 Inequalities and product: A great deal 
of works concerning ordered algebraic structures 
(see for instance [14]) make the hypothesis that 
positive product is order preserving, that is to 
say, that for every a,b,c: 

a >b & c?0 => ac > ke 


This hypothesis is obviously false in R: For 
instance 1-D is positive, but (1-D)= 1-2D+D? is 
not. So let us consider the set Mon(R) of order 
preserving pseudo events: 


Mon(R)= {xe RlaceR&a>O => ax >0 } 


It can be easily shown that Mon(R) = E. 
Example: Let us consider the two inequalities : 


(1) 
x < = (2) 
Pepe 
(1) means that x cannot have two occurrences 
separated by a delay smaller than A (cf.3.1). 
Since 1/(1-D") is an event, we may multiply by it 
the two members of (1), so (1) implies (2). But 
the converse is false, because 1-D" is not an 
event: Figure 4 pictures an event satisfying (2) 
but not (1), with =. 


x(1-D4) <1 
1 


Figure 4 


5. APPLICATION TO DESIGN PROBLEMS 


In this section, we shall illustrate the use 
of the calculus on pseudo events on two simple 
problems. 


5.1 First Example 


A system receives two strictly periodic 
sequences of input requests. The former sequence 
starts from the instant 0, with a 2. seconds 
period, and the later one starts fram the instant 
1, with a 4 seconds period. The system is made of 
n identical processors, each of which takes 7 
seconds for processing a request belonging to the 
former sequence, and 5 seconds for processing a 
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Figure 5 


request from the later one. This system may be 
represented by the timed Petri net of Figure 5. 


The question is: What is the minimum number 
of processors needed so as to take into account 
every request as soon as it happens. 


In our model, this problem may be stated as 
follows: Let a, B be the events respectively asso~ 
ciated with input arrivals from each sequence. Let 
¢, d respectively represent the event "an input 
from the former (resp. later) sequence is taken 
into account by some processor", and 6, £ respec- 
tively represent the event "a processor ends 
processing an input from the former (resp. later) 
sequence". Then: 


. The specification of input sequences may be 
written: 
4= D’4 + 1 and B DB + D (1) 
Since a request cannot be taken into account 
before its arrival, we have: 
@é<d&anda<5 (2) 

- The processing times of requests are specified 
as follows: 
é=p/é and f =D (3) 

. AS a request may only be taken into account when 

there exists an idle processor, we get: 


(4) 


GE+a<6+f +n 


- Finally, the immediate handling requirement 
provides: 
€é=aandd=5 (5) 


Now, (1) reduces to 
and 


So getting rid of any event variable, the problem 


may be restated as follows: 


"Find the least integer n, such that 


7 
+=F + en <n a 
1-D 1-D 


or "what is the maximum value of the counter of 
the pseudo event 


-l+p+p-p-p'-p’, 
1 - Dé 
Now, we can perform the division in x, until 
getting: 
x=lepdD+repep-p ine 
i- Dp 


-p/(1-D) /(1-D*) is a periodic pseudo event, the 
counter of which can easily be shown to have the 
maximum value 0. Thus, the maximum value of the 
counter of x is the one of 14+D+D?-H%4D° , which is 
5 (see figure 6).So n=5 is the solution. 


wy (t) 


MPPe wo Fo 


012 3 4 5678 91011 t 


Figure 6 


5.2 Second Example 


let us consider two processes p, and P: 
sharing an exclusive resource. Each process i 
cyclically asks for the resource, uses it during a 
delay 5; , then releases the resource and works 
during a delay Aj, (6j, Ay>0) , after what it 
comes back asking for the resource. This system is 
represented by the net of Figure 7. 

Now assume that the resource is very expen- 
Sive and is required to be permanently used. The 
problem is: What condition must satisfy the delays 
5), Aye &. dy, to achieve this requirement? 


With the notations of the net, the problem 
may be stated as follows: Find a necessary and 
sufficient condition q@m the delays 63, A; so that 


156 


the following system S admits a _ solution 
(e],€9) in ExE: | 


e,(1 - Dit4i) <1, G=1,2 
? 5 5 
e,(1 - D1) +e\(1 - DY) =1 
: ooT 69, _ 
Now, since e}(1-D°+) + eg(1-D*) = 1 which 
is an event, then D°le, + D°2e, must be a subse- 
quence of e, + @&- On the other hand, 
Aj > 0 and e)(1-p°1*41 ) <1, D°le; has no simlta- 


neous occurrences with e,, for one can easily show 


since 


that for every integer n: 
e] (ntl) > (D°le, )(n) > eq, (n) 


So p°le} (respectively p°2e5) mist be a sub- 
sequence of e, (resp. e))- 


e) De, and ae ne are events and their 
sum equals 1, so one of them must be equal to l 
and the other to 0. Therefore: 


S = Sid or Sou! where 
ie 
* =D. — 
Sigt | eg= “gray F ey pelea ‘ 
§ 5: +A bath. 
p_J(i-p 7__*) 1-p J.J 
Spe a cl& oa, <i 
1-D 1-D 
So, a solution satisfying S exists if and 
only if 
6, +A + 
ocr ant ED 
1 1-p 1 


which is equivalent to 
6, +A] < 613+ 69 and &tdg < 6, +59 
The final necessary and sufficient condition is 


Ay < 69 and Ag < 6) 


6 APPROXIMATE ANALYSIS USING DISCRETE TRANSFORM 


In section 5, we have given sone examples of 
the use of the formal calculus in proving proper- 
ties about behavioural specifications. Of course 
the proofs performed there may have appeared 
rather ad hoc, and are not susceptible of systema- 
tization. On the other hand, it has been shown in 
§4.3, that the non nonotonicity of the product 
over pseudo events may give rise to difficult 
problems in dealing with linear inequalities. In 
this section, we shall propose a systematic method 
providing approximate results, even when such 
difficulties arise. 


Our definition of pseudo events by means of 
formal series of the delay operator D is very 
close to discrete transform techniques widely used 
in the field of finite difference equations. 
Nevertheless, to our knowledge, those techniques 
never have been applied to inequalities. 


6.1 Definition: 
= e s 
a= aa apn) , let us define the function 4, 
from IR’ to IR, by: 


For every pseudo event 


‘ie rx. rt 5 alm) 

a ml n 

¢, is generally a partial function, only defined 
on an interval [0, raL , Where r_ is the convergence 


: : a 
radius of the series. 


6.2 Theorem: If a is a positive pseudo event, 
then ¢ is positive on the interval jO,min(1,r,)]. 
The converse is not true. 


6.3 Example of application: Let us come back 
to example 5.2. We want the system S to have a 


solution, where 


| 


Eliminating G5, we get: 


= 


Now this system admits a solution e, only if 
there exists a real function 6 (=e, ) such that, 
for every x in [0,l1L: 


e,(1-p’itoiy <1, i=1,2 


e,(1-D°L) + e)(1-p%) =1 


e,(1-p tl) <1 


(1-p°Ly (1-p°2"°2 ) 


| o(x) (l-x“2FO1) <a 


x2(1—x'2) (x) (1ox?1) (1-x"2*%2) 


1-x l-x 


which is equivalent to: 
¥X € [oO ri [ e 


F(x) 


In the neighbourhood of x=1, F(x) ~ -#--*--=- 
So a necessary condition for the system S to have 
a solution is: 


AyAn < 462 
It is exactly the result provided by the 
method of [12] to find permanent behaviours of 
timed Petri nets. Notice that it is m@mly a neces- 
sary condition, since the n.s.c. found in 5.2 was: 


Ay < & and Ay < 6) 
7 DISCRETE TIME 


All the non real time, and nost of the real 
time digital systems make use of a discrete notion 
of time. This motivates the investigation of 
particular properties of R(Z) which is done in 
this section. 


7.1 Discrete Derivatives 


7.1.1 Definition: If a R(Z%), let us call the 
derivative of a the pseudo event a(1-D). 


This denomination is motivated by the follo- 


wing - obvious, but very useful - proposition, 
which corresponds to the property of real 
functions, that a function is increasing if and 


only if its derivative is positive: 


7.1.2 Proposition: A necessary and sufficient 
condition for a pseudo event a in R(Z) to be an 
event, is that its derivative is a positive pseudo 
event. 


Example: Let us come back to the example 
given in 3.3. As announced there, we are now able 
to express the condition om @ for D934 + Dé -4@+D 
to be an event, which is: 


D(1-D) » @(1-D-D?)(1-D) 


7.2 Linear Inequalities and Fixed Points 
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Notations: For each a,b in R(Z), let us 
define: 
. fa) = {xe R(Z) |} ae 
- (bo = {xeR(Z) | x<b 


. a,b] = Ta) n (b] 


7.2.1 Proposition: For each a,b in R(Z), [a) 
(respectively (hI, Ta,bl]) is a complete inf-closed 
semi lattice (resp. sup-closed semilattice, 
lattice), i.e. every subset of [a) (resp. (bl, 
fla,b]]) has a greatest lower bound (resp. a least 
upper bound, a least upper bound and a greatest 
lower bound). 


Notice that R(IR) does not satisfy this 
property: For instance, the sequence: 


2n—2 2n-l1 . 
( x= pen-I_ p 2n mewNW ) 


is included in [ 0,1-D J], but has no least 
upper bound in R(IR). 


7.2.2 Proposition: Let us recall that a 
function f£ from R to R is’ said to. be 
latticecontinuous, if and only if, for every 
subset X of R admitting a least upper bound 
x, (resp. a greatest lower bound x ) the 
set {f(x) x xX} admits a least § upper 
bound y such that y= f(x) (resp. a greatest lower 
bound y such that y= f(x) . 


Then, for every A in T and every pair (f,g) 
of lattice continuous functions, the functions 
Ax.D°x, Ax. f(x)+o(x), \x.inf(£(x),g(x)), 
Ax. sup(f£(x),g(x)) are lattice continuous. 


7.2.3 Application:Let us consider a system of 
linear inequalities in R(Z), of the following 
form: 

s= | x(1-e, ) <b,, iel..n } 
where all the e; are events such that e,; (1)>0. 


Then the set P of solutions of S is the set 


of pre-fixed points of the function fe 

Ax. -inf:_ (b:+e:x) , Which is lattice continu- 
i=l..n‘~1 “i 

ous. 


On the other hand, from 4.2.2 and 4.3.2, we 


x(1-e,) <b, => x <b, /(1-e,) 

So P is included in (@] , ~~ with 

p= inf;—), _,(b;/(1-e;)) . Since(g]] is a sup-closed 
semilattice, if P is not empty, it admits a least 
upper bound 8 . By Tarski's fixed point theorem, - 
B is the greatest fixedpoint of f,. Furthermore, 
the sequence (f5(8) | icIN) is included in the 
complete lattice [ 6,6 J, and by Kleene's fixed 
point theorem, it converges towards § . Note that 
P is generally only included in @ ]. The point is 
that by this process, we can add to S a new 
inequality, which is implied by S and may be satu- 
rated, since fe P. 


Exanple: 
Let us consider the following system of 
inequalities: 


Be ee 
1-D 1-D 

4 

x<1l- oe 

1+D 


Neither of the two inequalities may be satu- 
rated by x without violating the other. But the 
system reduces to: 


x= ~*~, x<«< £(x) 
1-D 


1 De 
with f= dX. inf(---~ , 1 - — + DX) . 
1- 1+D 
Using the above notations, we get: 


1 D 1 


: 1 
Ip 1D 1p ip 
Let us compute the greatest fixed point g of f, 
smaller than p. We get: 


= £°(g) = B= 1/(1-D) 

B, = £!(8) 

= in€(1/(1-D°), 1-D*/(14D)4D/(1-D)) 
=l] + (p?4p’)/(1-®) (see Figure 8) 


B, = £ (8) 
= inf(1/(1-D’), 1-D*/(14D) + D + (D*4p8)/(1-p®) ) 
= By 

So, 3.7 

p=1 +o 

1-D 

and the initial system implies: 

x p4p/ 


As a last illustration of the descriptive 


PUT red 


Boones es’ 


foocececs 


* 
Bemeca ssl 


: 
at eee cescenaassescocssas eevee s' 


9 10 11 12 t 


: 3 
= Vaep*s(i+p)+D/(1-D ) 


wow mw ww we 


Na iD?) 
Figure 8 


power of our calculus, let us consider the des- 
cription of a task that needs a delay Ac IN, but 
may be interrupted on every integer instant. The 
task is assumed to be non reentrent. 


Figure 9 


Modeling this task leads to a very complex 
timed Petri net (Figure 9). In this net, the tran- 
sition a represents the beginning of the task. 
When the token reaches a place J;, the task may be 
either inmediately interrupted hy the firing of 
c,, then entering the interrupted state I until 
reactivated by the firing of d., else continuated 
for one unit of tine in W; before becoming again 
interruptible. b (=b ie represents the end of the 
task. 


Proposition: Let 4,6,¢,d4 be the four events 
respectively representing the beginning, the end, 
the interruption and the reactivation of the task. 
Then, given 4,¢,d, the event 5 is uniquely deter- 
mined by the folowing relation: 


The proof is rather tedious [5], but comple- 
tely formal, and the result proved is not trivial 
and may be used to deal with systems with inter- 
ruptible tasks in a very simpler way than by means 
_Of timed Petri nets. 


CONCLUSION 


This paper has presented a model for real 
time and parallel systems, and a set of results 
allowing, to some extend, the transformation and 
analysis of the description of these systems in 
the model. This work must be extended particularly 
in two directions: 
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First, the power of the calculus must ke 
increased. We have shown that a great deal of 
problems involve investigations on systems of 
linear inequalities. For instance, let us consider 
two oommunucating asynchronous processes like in 
ccs [7]. Assume each process may be described by a 
system of linear inequalities over its external 
events. Then the resulting process will be descri- 
bed by the conjunction of the two systems, where 
the interprocesses communicatim events have been 
equalized and eliminated. So we must be able to 
eliminate a variable from a system of linear ine- 
qualities without loosing any information about 
the remaining’ variables. Furthernore, many 
problems, and particularly scheduling problems, 
may be expressed by linear optimization problems 
over (pseudo) events. But the partiel nature of 
the ordering relationship gives raise to a lot of 
difficult questions in applying linear programming 
techniques. 


Another future extension concerns numerical 
systems. One way is to combine the results obtai- 
ned by our calculus with classical techniques of 
program analysis. Another possibility is to extend 
the model to deal with variables. This was done in 
[4] for specification purposes, but the extension 
of the calculus to such a widened model is far to 
be obvious. 


In spite of these questions, the model 
presented here seems to us a powerful tool to des- 
cribe and analyse the behaviour of parallel and 
real time systems, and a unifying framework for a 
lot of problems in this field. Of course this 
approach is not considered as concurrent to the 
classical state-transition omes, but is expected 
to lead to complementary results. 
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RESOURCE EXPRESSIONS FOR APPLICATIVE LANGUAGES 
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Abstract -- A high-level approach to resource 
management in the framework of an applicative 
language is presented. A resource is defined as a 
linguistic construct that may be used either to 
exercise control over the concurrent evaluation of 


functions, or to serve as an interface to files, 
databases, etc. The specification of this control is 
achieved by resource expressions. Resource 


expressions are closely related to path expressions in 
their basic approach to specification of constraints, 
but differ in their semantics and implementation. The 
semantics of resource expressions is based on the 
concept of execution graphs and residues, and an 
implementation has been constructed using a set of 
queueing primitives for a demand-driven execution 
model. 


1. INTRODUCTION 


The results presented here are motivated by a 
desire to develop resource management primitives 
which mesh well with an applicative programming 
language. In retrospect, the language constructs to be 
described also work well for ordinary languages, but 
there are other options for those languages which are 
not attractive for applicative languages. 


The major advantage of an applicative language for 
distributed systems is that no special care need be 
taken in exploiting available concurrency; the results 
of a program are guaranteed to be well-defined, 
independent of system timings. However, the concept 
of "resource management” for such a language may 
still be relevant, on two grounds: 


1. It may be necessary to control the amount of 
concurrency which would occur naturally within 
the execution of a program, as concurrent 
evaluation of functions require additional 
resources (e.g. memory) to support. 


2. It may be desirable to augment an applicative 
language with constructs to enable efficient 
interfacing with files and databases, ioe., 
structures that may change because of side- 
effects. Techniques are then needed to 
encapsulate such side-effects and make them 
interface cleanly with "pure" applicative code. 


After some experimentation with various 
encapsulation methods for resource control 
(resembling monitors, serializers, etc. [3, 13, 14]), it 

.was decided that such methods do not fit well within 
the framework of applicative languages, as they require 
the introduction of operational notions such as queues, 
messages, etc., and also do not lend themselves toa 
convenient denotational semantics. On the other hand, 
expression-based languages, such as path-expressions 
and their variants [1, 6, 7, 12] are attractive for three 
reasons: 
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1. They may be composed from primitive constructs 
similar to the way functional expressions are 
composed, and hence are compatible with the 
applicative style of programming. 


e. They possess "bracketing" qualities similar to 
function evaluation, i.e., it is not necessary for the 
programmer to indicate explicitly the start and 
stop of an action. Rather, these events are 
contained in the notion of a functional expression 
being evaluated as a unit. 


3. They have been derived from the notion of regular 
expressions [21] which can be considered as a 
denotational description of finite automata, 
Suggesting that their extensions to resource 
control (which require transcending the finite- 
state languages) might also be amenable to a 
denotational semantics. 


We describe here an expression-based language 
extension called resource expressions. Its uses, 
semantics, and implementation are the subjects of this 
paper. Resource expressions are closely related to 
path expressions in their basic approach _ to 
specification of constraints, but differ in their 
semantics and implementation. 


There have been some attempts to introduce the 
concept of a resource in an applicative framework: 
Arvind et. al. [2] present dataflow monitors as a means 
for defining a resource and its scheduling, and Gurd 
and Catto [11] present some implementation ideas for 
dataflow monitors. In comparison, resource 
expressions are a higher level means of specifying 
resource control, since certain types of scheduling 
disciplines are expressed more succintly in resource 
expressions. However, the expressiveness of resource 
expressions in their current ferm is more limited 
compared to dataflow monitors. 


Friedman and Wise [10] introduce an 
indeterminate operator frons for constructing a 
multiset, the order of whose elements is determined 
only when the multiset is accessed. Although frons 
may be used to express solutions to a variety of 
problems requiring the use of indeterminate merging, 
the issues of resource control are not handled at the 
level at which frons is used. 


Another type of approach, the use of pseudo- 
functions [16], is attractive, but is less structured than 
the one presented here. However, pseudo-function 
constructs are employed in the implementation of our 
current model. 


@ LINKING RESOURCES TO FUNCTIONAL EXPRESSIONS 


Programs in applicative languages are presented 
as expressions denoting the application of functions to 
their arguments. We refer to these as functional 
expressions to distinguish them from resource 
expressions, which form the main topic of this paper. 


Suppose that we desired to exert greater control 
over the evaluation of various functions. We could use 


some device which explicitly sequences those functions 
[14, 17]. However, if the functions are evaluated in 
unpredictable order or are embedded within very large 
expressions, it is then desirable not to impose a rigid 
sequencing, but rather to impose a system of 
constraints on functions, e.g. that certain 
subexpressions do not get evaluated concurrently, etc. 
This has the effect of giving greater freedom on the 
order of expression evaluation, in the case where it is 
difficult to determine a priori orderings giving the right 
amount of concurrency. These constraints are 
expressed herein as resource expressions, and we 
think of them as providing a kind of "synchronizing 
overlay” ona functional expression. 


To indicate the invocation of a function f which is 
controlled by a resource, we use res(’f, args), in place 
of the usual f(args). Here res is a pseudo-functional 
object which represents an instance of the resource, 
and is created by evaluating a pseudo-function 
specifying the actual resource. Since the function 
being evaluated is encapsulated inside the resource, 
the quoted f is used to avoid lexical scoping violations. 
A variable denoting the quoted f could be used instead. 
The definition of the actual resource takes the form 


RESOURCE ...resource name... parameters... 
CONSTRAINT 
.. resource expression... 
WHERE 
ACCESS ...function definition... 
(with optional IMPORTS) 
ACCESS ...function definition... 
(with optional IMPORTS) 


BND 


where ACCESS is used to identify functions that may 
be invoked from outside the resource. The current 
version of the language extensions also allows nested 
resource definitions, but we will not be concerned with 
them in this paper. 


Once a resource is instantiated, it may then be 
accessed by so-called tokens. A token is a request {or 
demand) to evaluate some function controlled by the 
resource, along with the actual parameters, if any, 
needed for this evaluation. The term token class will 
be used to refer to the function that is controlled by 
the resource. 


We now sketch two examples illustrating different 
uses of resource expressions: the first illustrates how 
to interface with a database, and the second, how to 
control the amount of concurrency arising in 
concurrent evaluation of functions. 


A skeletal example of a resource manager that 
encapsulates a shared database accessed by "read" 
and ''write"” operations is 

RESOURCE database_manager(database) 
CONSTRAINT 
(write*+[read])# 
WHERE 
ACCESS write(...) IMPORTS database 
ACCESS read(...) IMPORTS database 
END 
The resource expression here enforces the well-known 
readers-and-writers constraint [8]. The subexpression 
write* allows a sequence of arbitrarily many write’s and 
the subexpression [read] allows in parallel arbitrarily 
many read’s. Since "+'' denotes nondeterministic 
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selection and "#" denotes non-terminating sequential 
repetition, it follows that read’s and write’s always 
exclude one another. If db1 and db2 are two distinct 
databases, then identical but independent managers 
for each could be created by equations 

LET resi = database_manager(db1), 

res2 = database_manager(db2) 

The two databases are synchronized independently 
using functional expressions such as_ resi(’read), 
rese(’write, val), etc. 


As a second example, consider the concurrent 
computation evoked by the following function 
definitions: 

FUNCTION main(x) 
RESULT IF x=0 THEN 0 ELSE 
f(x) *e(x)/h(x) + main(x-1) 
WHERE FUNCTION f(x)... 
FUNCTION g(x) ... 
FUNCTION h(x) ... 
END 
Suppose we wished to constrain the evaluation of f and 
g (but not h) so that only one of them is evaluated at 
any given time; in other words, f and g must be 
executed in mutual exclusion of one another.. We may 
express this constraint as follows: 
FUNCTION main(x) 
LET res=mutex() 
RESULT IF x=0 THEN 0 ELSE 
res(’f, x)*res('g,x) /h(x) 
+ main(x-1) 
WHERE RESOURCE mutex() 
CONSTRAINT (f+g)# 
WHERE ACCESS f(x) ... 
ACCESS g(x) ... 
END 
FUNCTION h(x) ... 
END 


If we wished to constrain h also, and further 
wished to allow arbitrarily many h’s to follow f or g, we 
could use the expression ((f+g).h* )#, where "." denotes 
sequencing. (The definition of h would now also have to 
be encapsulated inside mutex.) 


If, in addition to the constraints of the preceding 
example, we were willing to allow arbitrarily many h’s 
to proceed concurrently with themselves, we would use 


((f+g).[h]})*. 


To summarize the available constructs, we present the 
syntax of resource expressions accompanied by a brief 
informal semantics. Each individual token class is a 
resource expression. Furthermore, if R and S are 
resource expressions, then so are 

R+S denoting the non-deterministic choice of either 
Ror 8 as alternatives; the alternative chosen 
must be "Satisfied" by the availability of tokens. 
R.S denoting the sequencing of R followed by S only 
when there are sufficient tokens to satisfy both 
R and 8. 


denoting a non-deterministic choice of an 
arbitrary number of sequential repetitions of R; 
the number of repetitions depends on the 
number of available tokens. 


similar R*, except no non-deterministic 
choice is involved; # does not terminate. 


R# 


similar to R*, except that consecutive 
repetitions may be done in parallel. 


{R} similar to R#, except that consecutive 
repetitions may be done in parallel. 

It should be mentioned that the number of 
repetitions in the above repetitive expressions, i.e. R’, 
R’, §R}, and [R], includes zero. Hence the number of 
tokens needed to satisfy such expressions is zero. 
Also, the meaning. of our sequencing operator "." is 
different from the sequencing operator ";" used in path 
expressions. We elaborate on this distinction in the 
next section. 


3. A BASIS FOR FORMAL SEMANTICS 


Our motivation for formalizing the meaning of 
resource expressions is to provide a_ precise 
specification not only for the user, but also the 
implementor. Most attempts at giving a semantics for 
expression-based languages have been informal [6, 7], 
operational [1, 19], or formal-language based [4, 23, 
20]. Of these the formal-language semantics is of 
interest here, since it comes closest to an acceptable 
denotational definition for resource expressions. 


Semantics based on formal languages define the 
meaning of an expression to be a set of allowable 
execution sequences of tokens, derived solely from the 
expression. In general, one considers a set of partial 
orders on tokens, rather than sequences, to account 
for concurrent execution. However, for sake of 
simplicity of presentation, we will use sequences 
instead of partial orders in the subsequent discussions 
in this section. 


such a semantics implicitly assumes that one is 
only interested in "consistent'' behaviors, i.e. the 
sequence of all tokens allowed to execute must be a 
prefix of some member of the above set. However, the 
notion of "completeness" is stronger, l.e., any sequence 
of tokens allowed to execute must be exactly equal to 
some member of the above set. We will refer to such 
such a sequence as a complete sequence. 


The notions of consistency and completeness are 
expressed by the following two semantic models: 


1. Consistency can be realized by an expedient 
approach, which chooses any alternative of 
the expression which is partially satisfied by 
the available tokens. 


e. Completeness, on the other hand, can only be 
realized by a prudent approach, which 
chooses an alternative of the expression that 


is completely satisfied by the available 


tokens. 


As an example, consider the expression {a.b + 
c.a). Assuming that an "a" and a "c" token are 
available, an expedient approach may choose the sub- 
expression "a.b", even though no "b" token had arrived. 
The "c" token will then not be ‘xecuted. A prudent 
approach would prefer the sub-expression "c.a", since 
it will permit both "a" and "c" to be executed. With the 
expedient approach, it is possible to get blocked after 
"a", since a "b" token may never arrive. 
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Thus, although the expedient approach is more 
efficient, since it provides faster response to certain 
tokens, it may fail to execute complete sequences, 
even when there are sufficient tokens. A prudent 
approach, on the other hand, generally would take 
longer to decide what to do with a given collection of 
tokens, but offers the advantage of always being able to 
execute complete sequences. 


We use a prudent approach for resource 
expressions consisting of a single sequence, e.g. a.b.a.c 
would require two a’s, one b, and one c token to be 
present before it is chosen as an alternative. To 
provide the efficiency of the expedient approach, we 
have introduced the construct "/" for commit which 
can be used in place of a "." in a sequence. The 
meaning here is that only enough tokens to enable 
execution of the prefix up to the "/" are required for 
committing to the entire sequence. Thus, in a.b/a.c 
one "a" and one "b" would suffice, and the subsequent 
"a" and "ce" would be processed when they arrive, but 
would not hold up the first "a" and "b” for their arrival. 
In this way we give the user the capability of choosing 
"shades" of expedience and prudence. 


Our semantics will be as if there were an implicit 
commit after the body of each of the repetitive 
expessions, as well as one at the end of a top-level 
expression. 


It should be noted that a/b is not equivalent to 
ata.b, and hence the commit construct can’t be 
simulated using simply sequence and alternation. To 
see this, compare the behavior of these two 
expressions on the input tokens {a,b}. The expression 
a/b will allow both a and b, whereas the expression 
ata.b will allow one of two possible outcomes due to 
the non-determinism of "+": either only a, or both a 
and b. Thus their behaviors are not equivalent. 
Alternatively, compare the expressions (ata.b)* and 
(a/b)*: the former allows a sequence of only a’s, 
whereas the set of sequences allowed by the latter is 
exactly the set prefixes of ababab... 


Semantics of path expressions define only the 
consistency requirement, and therefore may be said to 
use the expedient approach [1, 8, 7, 12|. The commit 
construct "/" is equivalent to the ";" of path 
expressions, but the effect of our sequence construct 
"is not achievable in path expressions. However, by 
adding the device of "predicates" [1], it appears that 
the effect could be achieved. This device could also be 
used to overcome some limitations of expression-based 
control describéd in [5], viz. the inability to specify 
constraints based on the state of the resource, 
parameters of tokens, etc. The proper integration of 


such devices into expression-based control for 
applicative languages is still a subject of our 
investigation. 


4. FORMAL SEMANTICS 


To provide a formal semantics which reflects 
completeness as well as consistency, we must take into 
account the collection of input tokens available to the 
resource expression. Since we wish to define the 
behavior of repetitive expressions inductively, we must 
not only define the allowable order of tokens, but also 
the collection of tokens remaining after each 
repetition. 


We therefore define the behavior of a resource 
expression for any bag of input tokens T as a set of 
pairs of the form <g, r>, where g is an execution graph 
and r is a bag of residues. These pairs will henceforth 
be referred to as g-r pairs. We use a bag, rather thana 
set, since we use to represent inputs which have 
several tokens belonging to the same token class. 


An execution graph is a generalization of an 
execution sequence, and is defined by the functions 
SEQ and PAR, which have the following meanings: 


SEQ(x,y) : x is executed before y 
PAR(x,y) : x is executed concurrently with y 


where x and y represent either tokens or execution 
graphs composed of SEQ and PAR. Both x and y must 
have completed their execution in order for PAR(x,y) 
or SEQ(x,y) to complete their execution. For PAR(x,y), 
x and y need not have started execution at the same 
time. 


The residue r is the bag of tokens T minus the 
tokens used in defining the execution graph g. 


To simphlfy the definition of its semantics, a 
resource expression is first converted into an 
equivalent normal form. We then define the semantics 
of normalized resource expressions inductively by 
showing how the set of g-r pairs can be constructedme 
examples illustrating our construction. 


The normal form is a set of alfernaiive prefized- 
sequences, where prefixed-sequences are of the form 
¥i/ °''/¥m where "/" is the commit construct, and 
alternatives are specified by "+". BKach x, except for 
x,, 1s a set of alfernative sequences, where sequences 
are of the form y,.°'*.y,, However, x, is a sequence of 
the form y,.:-:.y, (with no alternatives). Finally, each 
y; is either an atom or a repetitive expression whose 
body has been expressed in normal form. Examples of 
the normal form are shown below: 

a.b.a 

a.b/(a + b/a) 

a* .b.{c} + a/(a + b) 
fa/b.c + (atb)* .b} 

Examples of expressions not in the normal form 
are the following: 

(a+b). (ce +d) 
(atb)/c 
((at+b)* )" 

The normal form is derived by transforming a 
given resource expression using two sets of equalities. 
The first set is the following: 

P.(Q+R)=(P.Q)+(P.R) 

(P+Q).R=(P.R)+(Q.R) 

(P+ Q)/R=(P/R) + (Q/R) 
where P, Q, and R are assumed to be arbitrary 
expressions. A notable exception, however, is that P / 
(Q + R) is not equal to P /Q +P / R. Consider, for 
example, the meanings of the two expressions a/({b+c) 
and a/b + a/c: the former expression specifies that the 
selection of b or c is to be made after a token for a 1s 
evaluated, whereas the latter implicitly selects b or c 
even before a is evaluated. Thus their operational 
meanings differ. 

The second set of equalities relates the repetitive 
constructs. A partial list is the following: 
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LHS of the above equalities 


((R)*)"=(R)" 
((R)*)#=(R)F 
[R}) = [R] 
[RD = 


In converting a resource expression into the normal 
form, a subexpression satisfying a form given on the 


is replaced by the 
corresponding RHS. 


We define the semantics of a normalized resource 
expression N for a bag of input tokens T by 
constructing a set L(N,T) inductively. We illustrate the 
construction for alternatives, prefixed-sequences, and 
the repetitive expressions only; a complete treatment 
may be found in [15]. 


i. For alternatives, the set of g-r pairs is the union of 
the set for each term. The union is taken to reflect 
the non-determinism of "+". 


e. For prefixed sequences, the g-r pairs for the 
sequence up to the first commit "/" are first 
constructed. For each residue r in the above set, 
the g-r pairs for the subexpression up to the 
second commit are obtained, etc. The resulting 
execution graph is obtained by sequencing (using 
SEQ) the execution graphs of each term: the 
resulting residue is that of the last term. 


3. Finally, for repetitive expressions, there are 
basically two cases: a) for "*" and "[]", the g-r 
pairs will also include the pair <e,T>, where e is 
the null graph and T is the input bag of tokens, 
whereas for "#" and "$3" this pair will not be 
included. b) For "*" and "#", sequences are 
constructed using SEQ, whereas for "[]" and ''§}" 
execution graphs are constructed using PAR. In 
all cases, the set of g-r pairs is constructed 
inductively: the residue from the first repetition 
being used as the bag of input tokens for the 
second, etc. 


We express the semantics of the above three types 
of expressions more formally as follows: 


1. Consider N = x, + xg... + x, where x;'s are prefixed 
sequences. We define 
L(N,T) = L(x,,.T) u L(xs, T) U... U LC xq, T) 
to be the set of g-r pairs for N, assuming L({ x,, T) is 
the set of g-r pairs for each xj. 


2. Consider N = x, / x, /... / x, where x, is an 
ordinary sequence, and all other x; are sets of 
alternative sequences. Then we define L(N,T) 
inductively as follows: Let : 

L{x,/ .../ X17) = §<gyrjp>|i=1.k}, and 
for i=1,k, L(x,.rj) = {<gy.ryq> | j= 1,4). 
Then we define 


L(N,T) = jer Yj=1.k,{<SEQ( gig) 1y>}- 


3. Consider N = [x] where x is any normalized 
resource expression. Let 
L(x,T) = {<gyrj>| i= 1,k}, and 
for i= 1,k, let L(N, rj) = {<gy.ry>. J = 133. 
Then we define 
L(N,T) = Ujere Ujet.m{<PAR(g,.84).ry>sU <e.T>, 
where e represents the null graph. 
We illustrate the set of g-r pairs for some simple 


resource expressions. We use bag(...) to denote a bag of 
tokens; bag() is the empty bag. 


(1)N=a/{b +c) 
T = bag(a,b) 
L(N,T) = {<SEQ(a,b), bag()>} 


(2)N=a/b+ a/c 
T = bag(a,b) 
L(N.R) = {<SEQ(a,b), bag()> 
<a, bag{b)>} 


(3) N=c.[a + bl].a 
T = bag{a,a,b,c) | 
L(N,R) = {<SEQ(c,a), bag(a,b)>, 
<SEQ(c,a,a), bag{b)>, 
<SEQ(c,b,a), bag{a)>, 
<SEQ(c,PAR(b,a),a), bagO>} 


5. IMPLEMENTATION OF RESOURCE EXPRESSIONS 


There are two important steps in the 
implementation of resource expressions: 


1. The expressions are represented in an 
intermediate form which consists of a set of 
condition-action pairs, similar to guarded 


commands [9]. (However, we are not relying onan © 


existing implementation of guarded commands in 
our implementation. ) 


e The target language program for a_e given 
intermediate form is constructed in a modular 
form by translating conditions and actions 
separately, and then combining the resulting 
programs together. Each repetitive expression is 
translated as a single recursive procedure, and 
the top-level expression is translated as a single 
procedure, if it is not a repetitive expression. 


The next three sections describe the intermediate 
form, the target language primitives, and _ the 
translation respectively. 


5.1. INTERMEDIATE FORM 


The intermediate form is derived using the 
semantics of the different types of normalized 
resource expressions, viz. sequences,  prefixed- 
sequences, alternatives, and repetitive expressions. In 
each case we determine a condition that must be 
satisfied before the corresponding action is taken. This 
condition is a conjunction of numeric thresholds for 
each token class, and indicates the minimum number 
of tokens of each class that must be present in order 
to take the corresponding action. The action specifies 
the actual sequence of tokens to be served. We first 
briefly explain how the condition-action pairs for a 
normalized resource expression are derived. 


For a sequence, the condition is determined by 
considering only its atomic terms, i.e. excluding all 
repetitive expressions in the sequence. Repetitive 
expressions do not participate in the construction of 
the threshold condition because they permit zero 
repetitions of the body to occur, and therefore have a 
trivially satisfiable threshold condition. However, when 
a repetitive expression is encountered during the 
action, the condition corresponding to the body of the 
repetitive expression will be tested to see if any further 
repetitions are possible. Thus the actual number of 
repetitions that occur depend on the number of 
available tokens at this time. 


For a prefixed-sequence, the condition is that 
determined by the (ordinary) sequence up to the first 
commit construct in the prefixed-sequence. The 
condition corresponding to the remainder of the 
prefixed-sequence is tested only after the action 
corresponding to the initial prefix has been taken. It 
should be noted that (COMMIT ...) may occur only as 
the last term in a sequence of length > 1. 


The intermediate form for a set of alternatives 
such aS W,tWet-:::+w, is ((c,a,)(Cea)-- + (c,a,)), 
where (c,a,) is the intermediate form of w;. 

The intermediate form of repetitive expression r is 
of the form 

(REPEAT ...) where 
REPEAT = If r = (x)* then STAR else 
If r = (x)# then POUND else 
Ifr =[x] then BRACKET else 
If r = {x}? then BRACE 
and the dots represent the intermediate form x. 


For example, the intermediate form of {a.b.a/b + 
a.[ate].c} may be derived from the above rules to yield 


(BRACE ({ce1 a1) (c2 a2))) where 
el = ((2 a) (1 b)) 
al = (ab a (COMMIT ((({1 b)) (b)))) 
ce = ({1 a) (1 c)) 
az = (a (BRACKET (((1 a)) (a)) (((1 ¢)) (e))) &) 


5.2. PRIMITIVES FOR SYNCHRONIZATION 


We now turn to a brief review of the primitive 
queueing operators for synchronization described in 
[14]. The primitive operator queue{) creates an empty 
queue initially. The contents of the queue may be 
modified, by a side-effect, via the operators enq and 
deg. eng(q,f) synchronizes the execution of a 
functional expression f by enqueueing a token for f in 
the queue q; the actual execution of f can be initiated 
only after the resource dequeues the token for f from 
the queue q, using deq(q). The value of enq(q,f) is the 
value computed by f; the value of deq{q) is delay(f), 
where the token for f is at the head of g. delay(f) is the 
unevaluated form of f, and the evaluation may be 
explicitly forced by force(d) where d = delay(f). If we 
wish to evaluate the token immediately after 
dequeueing, we may use eval(q), which is equivalent to 
force(deq{q)). Separating the dequeuing of a request 
from its evaluation facilitates the execution of several 
tokens from a single queue in parallel. 


When multiple queues are used to synchronize 
several different classes of tokens, it is often necessary 
to test for the presence of tokens in the different 
queues. The operator waitq(g,n) tests and waits until 
q has at least n tokens in it, and only then returns a 
value, say true, as its result. In contrast to waitg, the 
operator nonempty(q,n) returns true if q has at least n 
tokens in it, and false otherwise; thus no waiting is 
involved. 


The last queueing primitive to be used here is 
reserveq(q,n) which reserves the first n tokens of q and 
makes them "invisible" during any subsequent testing 
of gq -- either by waitq or nonempty. The motivation for 
this operator will be clear when the translation of 
resource expressions is considered. 


We summarize all the queueing operators below: 


queue() creates an empty queue. 

enq(q,f) synchronizes the evaluation of f using q 
by enqueueing a token for f in q. 

deq(q) returns an unevaluated form, delay(f), 
where the token for f was at the head 
of q. 

force(d) evaluates f, where d = delay(f). 

evalaq(q) dequeues and evaluates f, where the 
token for fis at the head of q. 

waitq(q,n) tests and waits until q has at least 


n tokens. 


nonempty(q,n) returns a boolean value indicating 
whether or not q has at least n tokens. 


reserveq(q,n) reserves the first n tokens of q. 


In order to arbitrate among several queues and 
exercise control over the order in which tokens from 
different queues are selected and evaluated, we 
introduce the following operators: 
seq(a,,....a,) evaluates the expressions a,...,a, in 

sequence; the result returned is a,. 
spar(a,,...,a,) evaluates the expressions Ay...5a, in 
parallel; the result is a,, but is 
returned after all a,,...,a, have 
been evaluated. 
evaluates a, and ag in parallel; the result 
is false if ag is evaluated before a,, 
otherwise true. 


arbit(a,,ag) 


5.3. TRANSLATION 


The basic approach to the translation is to allocate 
one queue for each distinct token class. Given an 
intermediate form ((c,a,) (cgae) ... (c,a,)), we test the 
conditions c},...,¢c, in parallel and select the one that is 
detected to be true earliest. This parallel testing and 
selection is accomplished by means of a chain of 
arbit’s as follows: 

LET t, = arbit(c,,tg) 
ty = arbit(ce,ts) 


t,-1 = arbit(c,_).¢,) 
RESULT 
IF t; THEN a, ELSE 
IF t THEN ay ELSE 


IF | THEN OAn-1 ELSE an 


where c,; and a, are to be replaced by their translated 
programs respectively. Note that cj is of the form 
({(n, Op,) (ng opg) ... (n, op;,)). Hence if we allocate q, to 
token class op), Gg to ops, .... gq, to op, , we may 
translate c; as  spar(waitg(q,,n,), waitq(ge,ng)...., 
waitq(q,.n,)), which tests and waits until the threshold 
condition c; becomes true. Note: The abbreviations 
t,,...,t,-; are treated as common subexpressions in 
FGL, and hence are evaluated only once. Also, the order 
in which the abbreviations are defined is immaterial. 


The general form of an action a; is (x, xe... X;) 
where each xk, can only be an atomic term or a 


repetitive expression; however, the last term, x, Can 
also be a prefixed sequence. For sake of uniformity we 
will assume that the result produced by a; is in the 
unevaluated form and must be forced expficitly, 
similar to that for any atomic term. Thus we have the 
following general form for the translated program for 
ay: 

LET d, = trans(x,) 
trans(xa) 


d; = trans(x;) 
RESULT 
seq(d),....d;, 
delay(seq(force(d)),...force(d;))) 


where trans(x) = 
If atom(x) then deq(queue_for_x) else 
If x; = (COMMIT ...) then commit(queues_for_x;) else 
If x = (STAR...) then star(queues_for_x) else 
If x = (BRACE ...) then brace(queues for_x) else 
etc. 
where commit, star, and brace are procedures for 
(COMMIT ...), (STAR ...), and (BRACE ...) respectively. 


The difference between # and * (and also between 
§3 and []) from the standpoint of their implementation 
is that the recursion in the former case has no 
termination condition, whereas for the latter the 
recursion terminates when none of the threshold 
conditions of its body is satisfied by the available 
tokens. Thus the recursion expands, in the former 
case, only as much as there are tokens in the input to 
satisfy some threshold condition of the body. 


When an expression occurs as the last term in an 
action, say x; the evaluation of x, must take place after 
Xj-1. but the threshold condition for x; may be tested 
concurrently with evaluation of x,,Xp,...,.x)_,. Assuming 
that the translated program for x; is represented by 
commit(queue_for_x,), we may express the translated 
program for (x,xg--* x)-,; (COMMIT ...)) by modifying 
the LET and RESULT expression above as follows: 

LET com = commit(queues_for_x;) 
RESULT seq(d,,...,d)-, 
spar(seq(force(d,),...,force(dj_4), 
force(com)), 
com)) 


The difference between the translation of * and [] 
is that the evaluation of successive repetitions will be 
sequential for * and concurrent for []. In both cases, 
however, the testing of the threshold condition of the 
body and the construction of the unevaluated form will 


‘be similar, i.e. the threshold condition of successive 
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repetitions of a [] will be tested sequentially. Once a 
threshold condition among the set of alternatives is 
selected, it is necessary to reserve as many tokens as 
indicated in the threshold condition. Such 
reservations ensure that these tokens are not re-used 
during the testing of the threshold condition inside the 
body of the repetitive expression. The number of 
repetitions in both cases will depend on the number of 
available tokens, but is in general indeterminate. 
Furthermore, the actual set of tokens used in 
constructing the unevaluated form is dequeued before 
any evaluation is initiated. 


We illustrate some of the important steps of the 
translation using the expression §a.b.a/b + a.[atc].c}. 
The intermediate form is 


(BRACE 
(((2 a) (1 b)) . 
~ {aba (COMMIT (((1 b)) (b)}) )) 
(((1 a) (1.0) | 
_ (@ (BRACKER (((1 2) (2)) (C1 ©)) ()) 


The translated program is 


PROCEDURE brace(qga,qb,qc) 
LET ti = arbit(spar(waitq(qa,2), waitq(qb,1)), 
spar(waitq(qa, 1), waitq(qe,1))) 
di = deq(qa) d2 = deq({qb) 
d3 = deq(qga) d4 = deq(qe) 
com = deq(qb) 
bre = brace(ga,qb,qc) 
brek = bracket({qa,qc) 
RESULT seq( IF t1 THEN seq(d1,d2,d3, 
delay(spar(seq(force(d1), force(dé), 
force(d3), force(com)), 
com})) 
ELSE seq(reserveq(qa, 1), 
reserveq(qc, 1), 
seq(d1,brek,d4, 
delay(seq(force(d1),force(bre), 
force({d4))))), 
bre) 
WHERE 
PROCEDURE bracket(qa,qc) 
LET t1 = or({n1,n2) 
di=deq(qa) ni = nonempty(ga,1) 
d2=deq(qe) n2 = nonempty(qce,1) 
brek = bracket{ga,qc) 
RESULT IF t1 THEN 
IF ni THEN seq(d1,brek, 
delay(seq{force(d1), 
| foree(brc)))) 
ELSE seq(d2, brek, 
delay(seq(force(de), 
force(bre)))) | 
ELSE nil 


In order to initiate the evaluation of the entire 
program, it is necessary to force the top-most 
expression by force(brace(qa,gb,qc)). 


6. SUMMARY AND CONCLUSIONS 


Resource expressions are proposed here as a 
high-level linguistic means of specifying resource 
control. These expressions are composed of primitive 
constructs, for arbitration, iteration, etc., and are 
capable of specifying solutions to a variety of 
problems. Given that it may be necessary to 
coordinate concurrent computations that arise in 
functional programs and interface functional 
programs with structures that have a shared state, 
e.g. databases, we feel an expression-based language 
like resource expressions is appropriate for this 
purpose, since they are notationally compatible with 
the applicative style, and also a simple denotational 
definition for them can be constructed. 


Resource expressions are very similar to path 
expressions in their basic approach to specification, 


bu differ in their semantics and implementation. We. 


have formalized the semantics of resource expressions 
in terms of a set of execution graph-residue pairs, by 
defining this set for any bag of input tokens. The main 
difference in our semantics is that we take into 
account the notions of consistent as well as complete 
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behavior of an expression. In order to provide the user 
the capability of choosing from different shades of the 
two approaches, the "/" construct has been included. 


We have presented a systematic translation of 
resource expressions in terms of the queueing 
primitives of a demand-driven execution model. In 
comparison to implementations of path expressions, 
our approach does not require restrictions, such as 
those barring repeated occurrences of an operation 
name, ete. [1, 6, 7]. The two main steps in our 
translation are the following: conversion to an 
intermediate form (condition-action pairs) based on 
the semantics of the different constructs, and 
translation of the intermediate form in terms of the 
queueing primitives. The latter translation is in turn 
separated into translating conditions and actions, and 
then combining the two translated program fragments 
together. Owing to the modularity in the translation 
process and its close bearing to the defined semantics, 
we have been able to construct a correctness proof of 
the translation for an abstract implementation [15]. 


An interesting application of demand-driven 
evaluation in this implementation is in representing 


infinite execution graphs: e.g. we translate the 
expression (a+b)# using what appears to be a 
nonterminating recursion; however, because of 


demand-driven evaluation, the recursion will expand 
out as much as is necessary to accomodate available 
tokens. Other benefits of demand-driven evaluation for 
resource control are discussed in [14], e.g. in 
rendering simple solutions to the problem of busy- 
waiting, etc. 


It is possible to perform several optimizations on 
the translated program, in order to reduce their space 
and time requirements, by techniques such as: 1). 
combining deq and force into evalq, 2) minimizing the 
number of queues, 3) avoiding unnecessary 
reservations, etc. A _ fuller discussion of these 
optimizations and the conditions under which they are 


applicable are presented in [15]. 
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PARALLEL IMPLEMENTATION OF FUNCTIONAL LANGUAGES 
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Abstract 


Functional programming, and its 
implementation using parallel architectures, is 
receiving increasing attention in the literature. 
Turner [4] has proposed a novel implementation 
for sequential machines using a variable-free 
form of code based on logical combinators. 


We present one translation of combinatory 
representations to process nets which allows full 
exploitation of parallelism. Our notation (LNET) 
is an exchange-view, behaviour passing variant of 
Milner's CCS. 


Introduction 


Programming even a sequential machine in a 
provably correct and maintainable fashion 
presents a complex and challenging intellectual 
task. Managing this complexity in the face of 
parallel architectures presents an immense 
challenge, and approaches which reduce this 
complexity are currently receiving widespread 
attention in the literature. 


Functional languages reduce the complexity 
of the programming task by prohibiting 
destructive assignment: a functional program may 
be viewed as a set of mathematical equations 
which specify the solution. This is good from the 
software engineering viewpoint, but makes life 
hard for the implementer who now has to work out 
when it is safe to forget values, resorting to 
garbage collection in extremis. 


On the other hand, because there are no 
side-effects in a functional language, expressions 
may always be evaluated in parallel, which 
suggests we may remedy at least some of the 
perceived inefficiency of functional programming 
by buying speed from parallel technology. 


We present here a language, LNET, for 
describing parallel processes, and show how 
functional languages can be translated via 
combinators into LNET. 


We rely heavily on the reader's 
willingness to read [2], [3], and [4]. The 
recent ACM conference [1] contains much useful 
background. 


Major characteristics of LNET 


LNET stands for Language of Named 
Experiment Trees, reflecting its origins in CCS 
[3], with which we assume familiarity. There are 
two major changes with respect to CCS: 


1. CCS ports, which present some difficulties 
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active partner in the exchange) directs a message 


of implementation, are eliminated in favour of 


process names, and the underlying message-passing 


medium is no longer assumed to be synchronous. 
Instead, each process has a name, generated at 
run-time when the process was created, by which 
it is known to other processes. A communication 
between two processes takes the form of an 
exchange of messages, in which one process (the 


to the other process (the passive partner), and | 
then waits for that process to accept the message 
and send back a reply. The passive process may | 
perform some local processing to compute the 
reply, but no intervening communications are 
allowed. The result is that a single exchange 

can be implemented as a pair of asynchronous 
communications, while being logically equivalent 
to one indivisible synchronous event. This allows 
LNET to be given a clean axiomatic semantics. 


2. Process behaviours and process names can be 
sent as messages from one process to another. 
This extension is necessary to allow the dynamic 
rearrangement of patterns of communication which 
our distributed implementation of graph reduction 
requires. 


Basic LNET constructs 
Space precludes a full syntax and semantics 


for LNET. Here we shall informally describe the 
principal constructs of the language. 


An LNET process is of the form X:p. p isa 
process behaviour, which specifies the 
communications the process is capable of. X isa 


process identifier, which at run time will be 
bound to an automatically generated process name 
unique to the process. (Note that process names 
themselves do not appear in the syntax of LNET). 
The behaviour p may take any of the following 
forms: 
1 ae NIL - the behaviour which does nothing. 

2, par Xl:p1 | ...| Xk:pk in p' 

This creates k new processes whose behaviours are 
pl,...,pk and whose run-time names are bound to 
the process identifiers X1,...,Xk. They run 
concurrently with the process X:p'. 


3. g.p' g is a guard, which is constructed 
from serial or parallel combinations of 
communications. In the notation of context-free 
grammars we have: 


g::=ec/g.g/ sale 


The communication c takes one of three forms. 


(i) e!X?x This is an active communication, 


directed at the process identified by X. The 
expression e is evaluated and sent to X;_ the 
communication then awaits the reply and binds it 
to x. The remainder of the behaviour in which 
this communication appears may use the value of x. 


(ii) x?!e This is a passive communication. It 
accepts from any other process a message which it 
binds to the variable x. It then evaluates the 
expression e (which may depend on x) and transmits 
the result back to the process making the active 
communication. The value of x is, as with the 
previous form of communication, available to the 
remainder of the behaviour. 


We extend this by also allowing passive 
communications to take the forms t?!e and 
t(x)?!e, where t is any of some countable but 
otherwise unspecified set of tokens. The 
communication t?!e will wait for some process to 
make it an active communication of the form 
t!X?x' (with the same token t). Similarly the 
communication t(x)?!e will only accept a 
corresponding active communication of the form 
t(e')IX?x'., 


(iii) wait This is also a passive communication. 
It suspends the process in which it occurs until 
some other process makes an active communication 
to it, and then proceeds without replying. The 
process making the active communication is still 
waiting for a reply; a later passive 
communication of the form (ii) will succeed and 
provide the reply. 


4, ind(X') This is an indirection behaviour. 
It accepts any active communication made to it 
and retransmits it to the process identified by 
X'. Xt, if and when it accepts the communication, 
will send its reply directly to the process that 
made it, rather than via the indirection process. 


5. LNET has a fairly conventional apparatus of 
let and where declarations, and conditionals. 


Translating functional programs into LNET 


We assume familiarity with Turner's paper 
[4] in which he shows how functional programming 
languages can be implemented by translating their 
programs into a variable-free form, by 
introducing a few constant functions called 
combinators. Apart from the usual basic values 
and operators, just two combinators, called K and 
S, are sufficient to express any functional 
program: 

(K x) y=x ((S x) y) z= (x 2)(y 2) 
These definitions of K and S can be read as 
rewrite rules, allowing any instance of their 
left hand sides (a redex) to be rewritten as the 
right hand side. After translation into 
combinators, a program can be executed by 
repeatedly applying these rules. A translation 
into K and S alone is highly inefficient; 
however, by introducing a few more combinators - 
six, in fact - a more efficient translation can 


be obtained. 


Expressions built up from combinators, basic 
values, and operators can be represented as trees, 
or, more generally, as directed graphs, which 
allow sharing of common subexpressions. We shall 
now show how these combinator graphs can be 
translated into LNET process nets. The basic 
idea is that each node of the graph is modelled 
by a process, and processes representing adjacent 
nodes of the graph communicate with each other in 
such a way that the resulting behaviour of the 
process net corresponds with the operation of 
graph reduction. Various regimes of graph 
reduction (normal order, parallel innermost, etc.) 
and combinations of these can be modelled by 
choosing the translation appropriately. 


The reduction rule for S, in graphical 
notation, is as follows. The nodes marked "@" 
(read "apply") represent function applications. 


@ 
a 
- SA, 


The circled part of the left-hand graph is the 
redex reduced by this rule. If. the nodes of the 
graph are processes distributed in some way over 
an underlying network of processors, a significant 
amount of non-local computation may be required 
just to establish the existence of a redex. Our 
first step is therefore to break down the 
reduction rules into smaller steps, each of which 
requires communication only between one process 
and its immediate neighbours. We subdivide S 
into three different forms, SO, S1 and S2, with 
the rules: 

S 


[> } Jf. > / 
SG. x _ x \y 


X 


@ 
cas — a 
2 
/\ J AS 
x Y J z 
We do the same thing for basic operators such as 
+: 


@ + @ +2 
ra aoe ¢ ‘y = - 
X 


+2 
JN —> a node holding the 
b 


a value of atb 
(when a and b are simple integers) 


List-handling operators (cons, nil, head, tail) 
can be handled similarly. 


We now define LNET behaviours to model 
these. 


SO = let p = app?!S1 . p inp 

S1 = AX. let p = app?!(S2 X) . p inp 

S2 = AX.AY. let p = app?!(S3 X Y) . p inp 
S3 = AX.AY.AZ. par V: (APPLY X 4) 


| W: (APPLY Y Z) 
: in (APPLY V W) 

Most other combinators may be modelled in the 
same way. Two, the identity I and the "“deleter" 
K, require the use of indirection behaviours. 
Their combinator reduction rules are: 

Ix~>x (K x) y> x 
Accordingly, we define the "incomplete" 
behaviours IO, KO and K1 analogously to SO and SI, 
and define I1 and K2 as: 


Il 


AX. imd(X). K2 = AX.AY. ind(X) 
APPLY is the behaviour which models apply nodes: 


APPLY = XX. AY. app!X?z . (z Y) 
This behaviour takes as arguments the process 
identifiers identifying the processes which 
model the left and right hand descendants of the 
apply node in the combinator graph. It sends a 
token app to the left hand descendant X. The 
reply it receives (which should be a parametrised 
behaviour requiring a process identifier) is 
bound to z. The behaviour then becomes the 
result of supplying the identifier Y as an 
argument to z. Comparing this APPLY with the 
definition of, for example, SO, we see that in 
the parallel combination 

X:SO | Y:(APPLY X Z)| Z:... 
the APPLY process sends the token app to the SO, 
which replies with the message S1. The APPLY then 
becomes the behaviour (S1 Z). The transformation 
of the process net can be pictured thus: 


Y: APPLY YZ YSZ 
X¥: $0 Doe x: So MOS sacs 


Note that the process X:S0 is still there and is 
ready to respond to another active communication. 
This is necessary because of the possibility of 
sharing. There may be many other APPLY 
processes sending app tokens to X, and the 
process X must reply to all of them. When there 
are no more references to X in the process net 
the X:SO process is garbage and may be collected. 
(Determining when this happens is a non-trivial 
question and is not addressed here). 


Constants such as 17 or true are 
represented by processes which repeatedly (because 
of possible sharing) send the constant they hold 
to any other process which asks for it. 


CONST = Ax. let val?!x . p inp 


The behaviour (CONST 17) expects a val token, to 


which it replies with the message "17". This val 
token will have been sent by a process 
representing a basic operator such as +: 

+2 = X.Y. (val!X?a | val!Y?b) (CONST (a+b) ) 


The + on the right hand side is the "real" one 
that actually does the addition. There are also 
a +O and +1 defined analogously to SO and S1. 


A proof that this translation correctly 
models combinator graph reduction requires the 
construction of a formal semantics for LNET and a 
mathematical statement of the precise 
correspondence between the reduction of a 
combinator graph and the behaviour of its LNET 
translation. It is beyond the scope of this 
paper. A detailed example of the reduction of a 
simple graph to normal form is presented in [2]. 


The translation we have given models the 
regime of combinator reduction which reduces 
every redex in the graph concurrently. This 
maximises parallelism but is dangerous in the case 
of graphs which, while processing a normal forn, 
also allow nonterminating reduction sequences. 
However, other reduction methods, such as lazy 
reduction, can be modelled by choosing other 
translations of the combinators, operators, and 
apply node. 


Conclusion 


The representation of functional programs as 
variable-free combinator graphs allows them to be 
modelled as networks of parallel processes which 
act in concert to perform a distributed 
evaluation of the whole expression. 
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Abstract 


An efficient parallel algorithm to obtain the postfix 
form of an infix arithmetic expression is developed. The 
shared memory model of parallel computing is used. 


Key Words and Phrases: Arithmetic expressions, 
postfix, infix, parallel computing, complexity. 


1. Introduction 


The parallel parsing and evaluation of arithmetic 
expressions has been the focus of research for many 
years. {1], [2], [9], [11], and [183] are some of the 
important papers written on the parallel evaluation of 
arithmetic expressions. The most significant result 
here is due to Brent [1]. Brent [1] has shown that 
arithmetic expressions containing n, n = i, operands; 
operators (+, *, and /); and parentheses can be 
evaluated in 4logen+10(n-1)/p time when p processors 
are available. Parallel parsing of arithmetic expres- 
sions has been considered by Fisher [5], Krohn [8], Lip- 
kie [12], and Schell [16] (among others). Fisher's work 
is restricted to vector (or pipelined) computers. While 
Krohn's work was intended primarily for pipelined com- 
puters (specifically for the CDC STAR-100), the ideas 
contained in [8] can be extended to parallel multipro- 
cessor computers. Krohn, however, does not consider 
the asymptotic performance that could be obtained 
from his parallel parsing algorithm. Lipkie [12] and 
Schell [16] explicitly consider parsing on parallel 


multiprocessor computers. Lipkie [12] provides some 
grammar rules for parallel parsing but does not 
develop a formal algorithm. Schell [16] is a thorough 
study of parallel techniques for several of the phases 
normally encountered in compiling (scanning, syntax 
analysis, parsing, error recovery, etc.). Schell 
develops a parallel LR parser. The complexity of this 
parser is, however, quadratic in the input size (under 
some constraints, he shows that its complexity 
becomes linear), Schell also discusses the applicability 
of his techniques to precedence grammars. 


In this paper, we develop a parallel algorithm to 
obtain the postfix form of an arithmetic expression. 
The reader unfamiliar with the postfix form of an 
expression is referred to Horowitz and Sahni [6]. 
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The model of parallel computation that we shall 
use here is commonly referred to as the shared 
memory model (SMM). Much work has been done on 
the design of parallel algorithms using the SMM. The 
reader is referred to [3], [4] and the references con- 
tained therein. 


While one can talk of obtaining the postfix form for 
an entire program, we shail limit our disucssion here to 
simple expressions. These are permitted to contain 
only operands (constants and simple variables), opera- 
tors (only the binary operators +, -, *, /, and ? are per- 
mitted), and parentheses ('(’, and ')'). 


The parallel algorithm that we shall develop here 
is closely related to the common priority based 
sequential infix to postfix algorithm. We shall make 
explicit reference to the version of this algorithm that 
is presented in [6]. This algorithm utilizes a stack as 
well as a dual priority system. The instack priority 
(ISP) of an operator or parenthesis is the priority asso- 
ciated with the operator or parenthesis when it is 
inside the stack. The incoming priority (ICP) is used 
when the operator or parenthesis is outside the stack. 
For the operator and parenthesis set we are limited 
to,the priority assignment of Figure 1 is adequate. 


The algorithm of [6] assumes that the infix expres- 
sion is in E(I:n) where E(i) is an operator, operand, or 
parenthesis, 1 <i<n (in practice, E(i) will be a pointer 
into a symbol table). For example, the expression 
A+B*C is input as E(1)=A, E(2)=+, E(3)=B, E(4)=*, and 
E(5)=C. The postfix form is output in P(1:m), m <= n. 
For our example, we shall have P{1)=A, P(2)=B, P(3)=C, 
P(4)=*, and P(5)=+. The sequential time complexity of 
the postfix algorithm of [6] is O(n). 


operator /parenthesis JSP ICP 
) - 


t.unary+,unary- 3 
+ / 2 
binary+, - 1 
( 6) 

0 


=0G 


' PR wh © 


Figure 1. Instack and incoming priorities. 


In Section 2, we shall see that the algorithm of [6] 
can be effectively parallelized. 


2. Parallel Generation of the Postfix Form. 


Let the infix expression be given in E(1:n) as described 
in Section 1. For every E{i) that is an operator or an 
operand, we define a value AFTER(i) such that E{i) 
comes immediately after E(AFTER(i)) in the postfix 
form of E(i:n). For the first operand in the postfix 
form, we define the AFTER value to be zero. Note that 
since parentheses do not appear in the postfix form, an 
AFTER value for them need not be defined. As an 
example, consider the expression (A+B)*C. Its postfix 
form is AB+C*, Since E(1:7) = ('(', A, +, B, ')’, *, C), 
AFTER(1:7) = (-, 0, 4, 2, -, 7, 3), 


Our parallel algorithm to obtain the postfix form of 
E(1:n) will consist of two phases. In the first, the values 
AFTER(1i:n) will be computed. In the second phase, the 
postfix form will be obtained using these values. In 
order to determine AFTER(i:n), we need to first com- 
pute the level L{i) of each token in the expression. 
Informally, the level of a token gives the depth of nest- 
ing of parentheses in which this token is contained. So, 
if a token is not within any parentheses, its level is 0. 
More formally, the level, L, is defined by the algorithm 
of Figure 2. 


1 if EG)="( 
step 1: Gfi) «5-1 if E@)=y ,1si<n 
O otherwise 


t 
step 2 L{i)+ }iG(j) ,lsi<n 
j= 


step 3 L{i) — L(i)+1 if E(i)=')'", lsi<n 


Figure 2 Computation of L. 


In Figure 3, we give an example arithmetic expres- 
sion together with the L() values associated with each 
token (row 3). 


91011 12 13 


14 15 16 17 


Let us sequence through procedure POSTFIX of [6] 
as it works on the example expression of Figure 3. The 
variable i points to the token in E that is currently 
being. examined. When i=1, E(1)="(' and '(' gets put 
onto the stack. Next, i=2, and E(2)=A is placed into the 
postfix form. When i=5, the postfix form has 
P(1:2)=(A,B) and the stack has the form --, (, *. During 
this iteration, * is unstacked (as ISP(*) => ICP(E(5))). 
We shall say that E(3) gets unstacked by E(5). E(5) 
gets added to the stack and on the next iteration, 
E(6)=C is placed in the postfix form. When i=18, the 
stack has the form -», +, tT, (, -, *, t *® and 
P(1:9)=(A,B,*,C,D,E,F,G,H). During this iteration, 
E(16)=1, E(14)=1, E(12)=*, and E(10)=- get unstacked 
(in that order). I.e., E(16), E(14), E(12), and E(10) get 
unstacked by E(18). Furthermore, E(10) is the last 
operator to get unstacked by E(1B). 


For each i such that E{i) is an operator, we may 
define U(i) to be the index in E of the operator or 
parenthesis that causes E{i) to get unstacked. In case 
E(i) gets unstacked after the entire expression has 
been seen, then U(i) = n+1. For our example, U(3) = 5, 
U(10) = U(12) = U(14) = U(16) = 18. Also, for each i 
such that E(i) is either an operator or a right 
parenthesis, we may define LU(i) to be the index of the 
last operator that gets unstacked by E(i). If no opera- 
tor is unstacked by E({i), then LU(i) is set to 0. For our 
example, LU(3)=0, LU(5)=3, LU(7)=LU(10)=LU(12) = 
LU(14)=LU(16)=0, and LU(18)=10. 


Continuing with our example, we see that when 
i=19, P(1:13)=(A,B,*,C,D,E,F.G,H,t,*,*,-), and the stack 
has the form -~,+,t, At this time, E(7)=? is unstacked 
and E(19)=* is stacked. So, LU(19)=7 and U(7)=19. 
Rows 6 and 7 of Figure 4 give the U, and LU values for 
ali the operators and parentheses of our exarnple. 
Note that U is defined only for operators and LU only 
for operators and right parentheses. 


An examination of procedure POSTFIX [6] and our 
definition of the level L of a token reveals that if E(i) is 
an operator, then: 


18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 


U 5 21 19 18 18 

LU ) 3 0 fy) 0 
AFTER Mo4s.219 310 BM 612 91411 
Position inP| M13 217 414 @ 513 612 7 


18 18 


0 0 


16 13 


11 810 94 16152334@ 


17 15 20 


21 31 27 30 33 


10 7 5 0 25 0 28 21 31 
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Figure 3 
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U(i) = least j, ] > i such that ISP(E(i)) = ICP(E(j)) 
and L(i)=L(j). If there is no j satisfying this 
requirement, then U(i)=n+1. 


From the definition of U, it follows that if E(i) is an 
operator or a right parenthesis, then LU(i) is given by: 


LU(i) = least j, j <i such that U(j)=i. If there is no 
j with U(j)=i, then LU(i)=0. 


Before proceeding to determine AFTER, it is useful 
to eliminate extraneous right parentheses. An extrane- 
ous right parenthesis is formally defined to be one for 
which the LU value is 0. Extraneous right parentheses 
together with their matching left parentheses serve no 
useful function but may be present in E nonetheless. 
Examples of occurrences of such parentheses are: (A), 
((A+B))*C, and (((A+B))) (extraneous right parentheses 
have been underlined). 


The elimination of extraneous right parentheses 
may be accomplished in the following way. Define 
C(1:n) as below: 


if Efi) = ) and LU(i) = 0 
Cli) = ( otherwise 


Let S(i) be the sum s C(j), isi<n. S(i) gives the 
j=1 


number of tokens in E(1:i) that are not extraneous 
right parenthesis. The replacement: 


(E(S()), U(SGi)), LU(S(G))) « (EG), U(i), LUG) 


carried out for all i such that E(i) is not an extraneous 
right parenthesis results in the elimination of all 
extraneous right parentheses from 5. 


As an example, consider the expression: 


(((A+B+C))+D)*(((E))) 


The extraneous right parentheses are underlined. Fol- 
lowing the elimination of these parentheses, the 
expression E takes the form: 


(((A+B+C)+D)*(((E 


As we shall see below, following the determination 
of the levels L(1:n), the left parentheses serve no useful 
function in our algorithm. Hence, these could be elim- 
inated along with the elimination of the extraneous 
right parentheses. To accomplish this, we need only 
define C(1:n) as: 


o if Eli) = (or (EG) = ) and LU(i) = 0) 


C(%) 1 otherwise 


and proceed as before. 
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Once the extraneous right parentheses have been 
eliminated, AFTER may be computed as described 
below. In the following discussion of the computation 
of AFTER, we assume that n has been updated to the 
value S{n) defined above. 


case 7: E(i) is an operand. 


In this case, we determine the largest j, j < i such 
that E(j) is either an operand or LU(j) is defined and 
greater than 0 (note that as extraneous parenthesis 
pairs are not permitted, if E(j)=')' then LU(j) > 0). 
Such a j does not exist iff E(i) is the first operand in the 


expression. From procedure POSTFIX and our 
definition of LU, it follows that 


0 if no j as above exists 
AFTER (i) =49 if Ej) is an operand 
LU(j) otherwise 


case 2 E(i) is an operator. 


In this case, we see that if there exists a j such 
that j > i and U(j)=U(i), then AFTER(i) is the smallest j 
with this property. So, in our example expression, 
U(10) = U(12) = U(14) = U(16) = 18. Also, in P, E(10) 
comes immediately after E(12) which comes immedi- 
ately after E(14). E(14) comes immediately after 
F(16). 


For E(16), however, there is no j, j > 16 and U{j) = 
U(16). For operators with this property, there are two 
possibilities: either Uf{i)-1 is an operand or U{i)-1 is a 
right parenthesis. If U(i)-1 is an operand, then E(U(i)- 
1) is the token placed in P just before the unstacking 
eaused by E(i}) begins. Hence, AFTER(i) = U(i)-i. If 
E(U(i)-1) is a right parenthesis, then this right 
parenthesis would have caused at least one operator to 
get unstacked (by assumption, extraneous parenthesis 
pairs are not permitted). Hence, LU(U(i)-1) # 0 and 
E(LU(U(i)-1)) is the operator that immediately pre- 
cedes E{i) in P. So, we get: 


j « least j, j > iand U(j)=U{i) 


U(i)—-1 if j is undefined and 
E(U(fi)-1)is an operand 
if j is undefined 

and E(U(i)-1)=')' 

j if jis defined 


AFTER(i) = 4 LU(U(i)-1) 


Row 8 of Figure 3 gives the AFTER values for all the 
operators and operands in our example expression. 
The AFTER values link the E{i)s in the order they should 
appear in the postfix form. This linked list is shown 
explicitly in Figure 4. From this linked list, we wish to 
determine the position, POS, of each operator and 
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operand in the postfix form. For che §f he operands, 
e., the one with AFTER(i)=0, this position is already 
known (it goes into P(1)). With each Efi), let us associ- 
ate a one bit field K{i). K{i)=0 iff the position of E{i) in 
P(i) has not been determined. Initially, K(i)=0 for all 
but one of the tokens (i.e. the one with AFTER(i)=0). 


For any node, i, in the linked list defined by the 
AFTER values, POS(i), is one more than the number of 
nodes preceding it in that list (the node with AFTER 
value 0 is the first node in the list; so the list is linked 
backwards). The POS values may be obtained by recur- 
sively splitting this linked list. The first time the list is 
split, we get two lists (A and B) consisting of alternating 
elements from the original list. The POS value of the 
first element in list Ais already known and that for the 
first element of list B is now known to be 2. Figure 5(a) 
shows the resulting lists when we start with the lists of 
Figure 4. The lists A and B are again split. When the 
list A of Figure 5(a) is split, we get the lists A1 and A2 of 
Figure 5(b). At this time, the POS value for the first 
node of list AZ becomes known, i.e., 3. Each time a list 
is split, we get two lists of about half the length. 5o, 
following flogn] splits, all lists will be of size 1 and all the 
POS values will be known. The formal algorithm to 
determine POS is given in Figure 6. 
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step 1 //initialize// 
case 
:AFTER(i) is undefined: K(i) « undefined 
:AFTER(i)=0 ; K(i) « 1; POS(i) « 1 
:else: K(i) « «0 
end case 


step 2 //split lists and compute POS// 
forv« 1 to[log n] de 
if K(i)=0 then j « AFTER(i) 

AFTER(i) « AFTER(j) 

if K(j)=1 then 
K(i) + 1 | 
POS(i) « POS(j)+2¥-1_ 

endif 

endif 
endfor 


Figure 6 Algorithm to compute POS. 


The correctness of the algorithm of Figure 6 can 
be established formally by providing a proof by induc- 
tion on the length of the initial linked list. We omit this 
proof here, 


Once the POS values have been computed as 
described above, the postfix form P is obtained by exe- 
cuting the following instruction: 


if AFTER(i) is defined then P(POS(i)) « Eti) 
Complexity Analyisis 


First, let us consider the computation of the levels 
L (Figure 2). Step 1 can be done in O(1) time using n 
PEs (each PE is assigned to compute a different G(i)), 
It can also be done in O{log n) time using n/log n PEs 
{each PE sequentially computes log n of the G{)s). The 
L(i)s of step 2 may be computed in O(log n) time using 


+ 
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(a) Splitting the list of Figure dp 
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Splitting the list 


Figure 
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A of (a) 


5 


n/log n PEs and the partial sums algorithm of [4]. 
Finally, step 3 can be preformed in O(log n) time using 
n/log n PEs. Hence, the levels L() may be obtained in 
O{log n) time using n/log n PEs. 


Next, consider the computation of U and LU. One 
possibility is to use mp PEs to first make m copies of 
each of the p operators and right parentheses in E (m 
is the number of operators in E). This takes O(log m) 
time. Note that O(logn) time is needed to avoid read 
conflicts. Each operator now has a copy of the opera- 
tors and right parentheses in E for itself. Each opera- 
tor E(i) is assigned p PEs to work with. These are first 
used to eliminate operators and right parentheses E(j) 
with j <i. Next, the level and ISP of E(i) is transmitted 
to the remaining operators and right parenthesis. This 
takes O(log p) time (again having no read conflicts) 
with p PEs. Operators and right parentheses with a 
different level number or with ICP > ISP (E(i)) are elim- 
inated. The operators and right parentheses not yet 
eliminated are candidates for U(i). The one with least j 
can be determined in O(log p) time using a binary tree 
comparison scheme and p PKs. If there are no candi- 
dates, U(ij=n+1. LU may now be determined in a simi- 
lar manner. This strategy to compute U and LU takes 
O{n*) PEs and O{log n) time. Using the techniques of 
[4], it can be made to run in O(log n) time using only 


O(n* /log n) PEs. 


An alternative stategy is to first collect together 
all operators and right parentheses that have the same 
level number. This can be done in O{log*n) time using n 
PEs as follows. First, each left parentheses determines 
the position of its matching right parentheses. This is 
done by simply sorting the left and right parentheses 
by their level number. If a stable sort is used, each left 
parenthesis will be adjacent to its matching right 
parentheses following the sort (Figure 7). The sort can 
be accomplished in O(log*n) time using n PEs [15]. 
Now, each left parenthesis can determine the address, 
M(i), of its matching right parenthesis. 


10 12 14 16 18 
Ea sal 
ZS 27 


Figure 7 
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Once M(i) has been determined for each left 
parenthesis E{i), we can link together all operators and 
right parentheses with the same level as needed in the 
computation of U. There are only two possibilities for 
any operator i. These are: 


(a) E(i+1)='(: In this case, E(i) is linked to M(i+1)+1. 
(b) E(it+1) # '(': In this case i+2=n+1 or E(i+2) is an 
operator. Regardless, E{i) is linked to i+2. 


Performing this linkage operation on the example 
of Figure 3 gives the linked lists of Figure 8. Now, each 
linked list can be treated independently. For opera- 
tors with the highest ISP (ie. 1), the U value is 
obtained by collapsing together consecutive chains of t 
so that all t point to the nearest non ft. The U value 
equals the link value. So, U(7) = 19, U(14) = U(i6) = 
18. For operators with the next highest ISP, the U 
values are obtained by removing all nodes representing 
the operator ft. The link values give the U value. Doing 
this on the lists of Figure 7, yields the lists of Figure 9. 
So, U(3)=5, U(19)=21, U(12)=18, U(2B)=30. Now, by 
eliminating all nodes that represent * and / and col- 
lapsing the lists we can determine the U value for the 
next ISP class. We obtain U(5)=21, U(21)=32, 
U(10)=18, and U(25)=27. Each elimination and collaps- 
ing operation above can be performed in O(log n) time 
using n PEs and the strategy used in Figure 6 to com- 
pute POS. Since the number of ISP classes is a con- 
stant, the time needed to determine U is O(log n). 


It should be evident that LU can be computed dur- 
ing the computation of U. Each operator and right 
parentheses keeps track of the farthest operator it 
unstacks from each ISP class. In comparing the two 
strategies to obtain U and LU, we note that the first 
strategy takes O({logn) time but requires O(n*/logn) 
PEs while the second strategy takes O(log*n) time and 
requires only n PEs. So, the logn speed-up of the first 
strategy over the second is obtained through a consid- 
erble increase in the number of processors used. 


The extraneous right parentheses can be elim- 
inated in Ofiogn) time using n/logn Ps. The initial 
values of AFTER() may now be computed. First, each 
operand determines the nearest (on its left) binary 
operator, right parenthesis, and operand. These are 
shown in Figure 9 for our example of Figure 3. Zeroes 
indicate the absence of a nearest quantity on the left. 
These three sets of nearest values can be determined 
in O(log n) time using n PEs. For example, to get the 
nearest operands, we eliminate all E(i)s that are not an 
operand. The remaining E(i)s are concentrated to the 
left. This enables each operand to determine its 
neares} left operand. Next, the operands are distri- 
buted back to their original spots (see [14] for an O(log 
n) distribution algorithm). 


9 21 31. 33 
Ei Ee ee ow 
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If E(i) is an operand and has no nearest operand 
on the left, AFTER(i)=0. If the nearest binary operator 
(on the left) has LU(Q) > 0, then AFTER(i) equals this LU 
value. If E(i) has a nearest right parenthesis (on the 
left) then AFTER(i) is the LU value of this parenthesis. 
Otherwise, AFTER({i) is the location of the nearest 
operand on the left. 


If E(i) is an operator, we can determine the smal- 
lest j, j > i such that U(j)=U(i) during the computation 
of U and LU. So, if such aj exists, AFTER has already 
been computed. If no such j exists, AFTER(i) is to be 
set to either U(i)-1 or LU(U(i)-1). Both these quantities 
are already known. So, the computation of AFTER for 
operators takes O(1) additional time. 


The computation of POS (Figure 6) requires only 
O(log n) time and n PEs. The formation of P takes 0(1) 
time and n PEs. Hence, using n Ps, the postfix form 
may be computed in O{log*n) time (the second strategy 


E ( A * B+ct €( D- £ 
Nearest binary 0 3 5 7 10 
operator 
rearest right 0 0 0 0 0 
parenthesis 
Nearest operand 0 2 4 6 9 


to compute U and LU must be used as only n PEs are 
available), The complexity is dominated by the sort 
step. Another complexity measure worth computing is 
the EPU (effectiveness of processor utilization). This is 
the the ratio of the complexity of the fastest known 
sequential algorithm and the product of the complexity 
of the parallel algorithm and the number of processors 
used by this algorithm, For our parallel postfix algo- 
rithm, we have: 


Note also that by using n® PEs and the first strategy to 
compute U and LU, the postfix form can be computed 
in O{logn) time. The EPU of the resulting algorithm is 


1 
OX togn ), 


3. Conclusions 


We have shown that it is possible to effectively parallel- 
ize the postfix algorithm given in [6]. Our parallel algo- 
rithm runs in O{log*n) time when n PEs are available. If 
only n/k PEs are available, our algorithm can still be 
used. The complexity will be O(k log*n), 


The results of this paper nicely complement the 
work reported on the parallel evaluation of expressions 
(see [1], [2], [9], [11], and [13)). 


123 4 5 6 7 8 91011 12 13:14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 
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A Parallel Matching Algorithm for Convex Bipartite 

Graphs* 

Kliezer Dekel+ and Sartaj Sahni 
University of Minnesota 


Abstract 


An efficient parallel algorithm to obtain maximum 
matchings in convex bipartite graphs is obtained. 


Key Words and Phrases: Parallel algorithm, convex bipar- 
tite graph, scheduling , complexity. 


1. Introduction 

A convex bipartite graph G is a triple (A,B,E). A = 

fa,,Qo,...,a, } and B = $b,,bp,...6,,3 are disjoint sets of ver- 

tices. E is the edge set. E satisfies the following proper- 

ties: 

(1) If (i,j) is an edge of E, then either i¢A and j€B ori €B 
and jcA; ie., no edge joins two vertices in A or 
two in B. 


(2) If (a,,b;)<E and (a,,0;4,)¢E, then (a;,0;4g)€E, 1<q<k. 
Property (1) above is the bipartite property while pro- 


perty (2) is the convexity property. An example convex 
bipartite graph is shown in Figure 1.1. 


e B 
a 1 b, 
9 b, 
“3 0 b, 
"4 
#5 


Figure 1.1 A convex bipartite graph. 


F ¢ E is.a matching in the convex bipartite graph 
G=(A,B,E) iff no two edges in F have a common endpoint. 
Fi=$(a1, 62), (4,03), (a5,0,){is a matching in the graph 
of Figure 1.1 while F2={(a;,0,),(a@1,b2),(a2,63)} is not. F is 
a maximum. cardinality matching (or simply a maximum 
matching) in G iff (a) F is a matching and (b) G contains 
no matching H such that HDF (H=number of edges in H). 
The matching depicted by solid lines in Figure 1.1 isa 
maximum matching in that graph. 
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In what follows, we shall find it convenient to have an 
alternate representation of convex bipartite graphs. It is 
clear that every convex bipartite graph G=(A,B,E), 
A={a,,...a,3, B={6,,...,0,,3 is uniquely represented by the 
set of triples: | | 


T = §(i,s; A, )i<i<n} 
S; =minfj(a,,b;)€ io; 
hy =max{j(a;,0;)<E} 


In the triple representation, we explicitly record the 
smallest (s,) and highest (A,) index vertices to which 
each a, is connected. For the example of Figure 1.1, we 
have T = §(1,1,2), (2,3,3), (3,0,0), (4,3,3), (5,1,3)2. 


As an example of the use of matchings in convex 
bipartite graphs, consider the problem of scheduling n 
unit time jobs to minimize the number of tardy jobs. In 
this problem, we are given a set, of n jobs. Job i has a 
release time 7; and a due time d;. It requires one unit of 
processing. We assume that r; and d; are natural 
numbers. A subset F of J is a feasible subset iff every job 
in F can be scheduled on one machine in such a way that 
no job is scheduled before its release time or after its 
due time. A feasible subset F is a maximum feasible 
met iff there is no feasible subset MAXM of J such that 


A maximum feasible subset F can be found by 
transforming the problem into a maximum matching 
problem on a convex bipartite graph. Without loss of 
generality, we may assume that min{7r,}=0; 7;<d,, 1<i<n; 
and max {d;3<n. The convex bipartite graph correspond- 
ing to J is given by the triple set T = {§(i, s;, Ay) | 
8§,;=7;,h,=d,-1}. Figure 1.2 shows an example job set and 
the corresponding convex bipartite graph G. Vertex i of 
A represents job i while vertex i in B simply represents 
the time slot [i,i+1]. There is an edge from jobi to time 
slot [j,jj+1] iff 7:<j<d;. Hence, every matching in G 
represents a feasible subset of J. Also, corresponding to 
every feasible subset of J there is a matching in G. 
Clearly, a maximum cardinality feasible subset of J can 
be easily obtained from a maximum matching of G. In 
addition, a maximum matching also provides the time 
slots in which the jobs should be scheduled. 


Glover [5] has obtained a rather simple algorithm to 
find a maximum matching in a convex bipartite graph 
G=(A.BE). Let hy=maxtj (a;,b;)< Ej, 1sisAl Glover's 
algorithm considers the vertices in B one by one starting 
at 6,. We first determine the set R of remaining vertices 
in A to which the vertex 6; currently being considered is 
connected. Let q be such that a,¢R and Ay = minh 


rj"s, d,-h,+1 A B 
is Cen REEenEE==—— nO) 
0 3 Oe, a 
2 4 —____— > 
2 6 eae 
4 5 ———— B 

La 
2 6 6 5) 


Vertex 6; is matched to ag and a, deleted from the 
graph. The next vertex in B is now considered. Observe 
that Glover's algorithm is essentially the same as that 
suggested by Jackson [6]. 


A straightforward implementation of Glover,s algo- 
rithm has complexity O(mn). When m is O(log log n), a 
more efficient implementation results from the use of 
the fast priority queues of van Emde Boas ([4] and [8]). 
The resulting implementation has complexity 
O(m+nloglogn). The fastest sequential algorithm known 
for the matching problem is due to Lipski and Preparata 
[8]. It differs from Glover's algorithm in that it examines 
the vertices of A one by one rather than those of B. This 
algorithm has complexity O{(n+mA(m)) where A(.) is the 
inverse of the Ackermann's function and is a very slowly 
growing function. 


In Section 2, we obtain a parallel algorithm for max- 
imum matchings in convex bipartite graphs. Our 
analysis of this algorithm will assume the availability of 
as many PEs as needed. This is in keeping with much of 
the research done on parallel algorithms. In practice, of 
course, only k processors (for some fixed k) will be avail- 
able. Our analyses are easily modified for this case. It 
will be apparent that if our algorithm has time complex- 
ity t(n) using g(n) PEs, then with k PEs (k<g(n)), its com- 
plexity will be t(n)g(n) /k. 


The parallel computer model used is the shared 
memory model (SMM). This is an example of a single 
instruction stream multiple data stream (SIMD) com- 
puter. In a SMM computer, there are p processing ele- 
ments (PEs). Each PE is capable of performing the stan- 
dard arithmetic and logical operations. The PEs are 
indexed 0,1,...,.p-1 and an individual PE may be refer- 
enced as in PE(i). Each PE knows its index and has some 
local memory. In addition, there is a global memory to 
which every PE has access. The PEs are synchronized 
and operate under the control of a single instruction 
stream. An enable/disable mask may be used to selecta 
subset of the PEs that are to perform an instruction. 
Only the enabled PEs will perform the instruction. Dis- 
abled PEs remain idle. All enabled PEs execute the same 
instruction. The set of enabled PEs may change from 
instruction to instruction. 
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If two PEs attempt to simultaneously read the same 
word of the shared memory, a read conflict occurs. If 
two PEs attempt to simultaneously write into the same 
word of the shared memory, a write conjlict occurs. 
Throughout this paper, we shall assume that read and 
write conflicts are prohibited. 


The reader is referred to [2] for a list of references 
dealing with graph algorithms, matrix algorithms, sort- 
ing, scheduling, etc. on a SMM computer. 


2. Parallel Matching In Convex Bipartite Graphs 


In Section 1, we showed that every instance of the prob- 
lem of scheduling jobs to minimize the number of tardy 
jobs could be transformed into an equivalent instance of 
the maximum matching in a convex bipartite graph 
problem, It should be evident that the reverse is also 
true. Hence, the two problems are isomorphic. A 
parallel algorithm for a special case of the job scheduling 
formulation was obtained by us in [1]. In this special 
case, it was assumed that all jobs have the same release 
time. This corresponds to the case when the convex 
bipartite graphs are of the form T={(i,s;,h,)i<i<n} and 
S,;=c,1<i<n for some c. 


We shall now proceed to show how the solution for 
the special case described above can be used to solve 
the general case when all the r;s are not the same. This 
will be done using the binary tree method described by 
Dekel and Sahni [2]. Rather than specify the new algo- 
rithm formally, we shall describe how it works by means 
of an example. 


A convex bipartite graph is shown in Figure 2.1. For 
this graph, W=14 and B=13. The Ss; and A; values associ- 
ated with each vertex of A are given in the first two 
columns of this figure. The first step in our parallel algo- 
rithm for maximum matching is to sort the vertices inA 
in nondecreasing order of s;. Vertices with the same s, 
are sorted into nondecreasing order of h;. For our 
example, the result of this reordering is shown in Figure 
2.2. 


Following the sort, we identify the distinct s, values. 
Let these be F,,/o2,...,A,. Assume that Ai< Ae<...<Rz. 
Let Ayy,=max{hj+1. For our example, k=4 and 
R(1i:k+1) = (1, 4, 9, 12, 14). 


We are now ready to use the binary tree method of 
[2]. The underlying computation tree we shall use is the 
unique complete binary tree with k leaf nodes. Figure 
2.3 Shows the complete binary trees with 4, 5, and 6 leaf 
nodes, For our example, k=4 and we use the tree of Fig- 
ure 2.3({a). With each node, P, in the computation tree, 
we associate a contiguous subset fu,ut+i,u+2,...,v} of the 
vertices in B. This subset is denoted [u,v].P or simply 
[u,v]. 
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Figure 2 3: Complete binary trees. 


Let the leaf nodes of the computation tree be num- 
bered 1 through k, left to right. If P is the ith leaf node, 
then [u,v].P is [A;,Aisi-1] (ie, u=Ay and v=Fy4-1). If P 
is not a leaf node, then the subset of B associated with P 
is [u,v].LC(P) U [u,v].RC(P) where LC(P) and RC(P) are, 
respectively, the left and right children of P. The sub- 
sets of B associated with each of the vertices in the 
computation tree for our example are shown in Figure 
2.4, The number in each node of this tree is its index. 


3 


[1,8] & [9,13] 


[1,3] [4,8] [9,11] [12,13] 


Figure 2.4 


Let P be any vertex of the computation tree. Let 
{u,v] be the subset of B associated with P. The subset of 
A available for matching at node P is denoted M(P) and 
is defined to be: 


M(P)={iu<s;<v} 


For,example, 


M(1)=$1,2,...,14}; 

M(2)={1,2,3,4,5,6, 7,123; 
~M(4)=$2,3,4,73; 

etc. 


The subset M(P) of A vertices available for matching 
at P may be partitioned into three subsets MAXM(P), I(P), 
and T(P). MAXM(P) is a maximum cardinality subset of 
M(P) that may be matched with vertices in [u,v].P by 
algorithm MATCH; this subset is called the matched sef. 
I(P) denotes the infeasible sef, It consists of all vertices | 
i€M(P)-MAXM(P) such that A,<v. The transferred set T(P) 
consists of all vertices i¢M(P)-MAXM(P) such that A, >v. 


Consider node 2 of Figure 2.4. The matching prob- 
lem defined at this node is given in Figure 2.5. Note that 
hi, = minfv, A}. A’ is the set M(2) and B’ is [u,v].2. If 
Glover's algorithm is used on this graph, then — 
$1,2,3,4,5,7,123 defines a subset of A’ that can be 
matched with vertices in B’. Further, this gives a max- 
imum matching. Hence, MAXM(2)=({1,2,3,4,5,7,123; 
1(2)=6; and T(2)=¢. Observe that IMAXM(1)lis the size of a 
maximum matching in the original convex bipartite 
graph. Also, T(1)=¢ and 1(1)=A-MAXM(1). 
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Figure 2.5 


We shall make two passes over the computation 
tree. The first pass begins at the leaves and moves 
towards the root. During this pass, the MAXM, I, and T 
sets for each node are computed. The second pass 
starts at the root and progresses towards the leaves. In 
this pass, the MAXM set for each node is updated so as to 
correspond to the set of A vertices matched by Glover's 
algorithm to the B vertices associated with that node. 


Pass 1 


In this pass, we make extensive use of the parallel algo- 
rithm developed in [1] for the case when all the s,s are 
the same. For our purposes here, it is sufficient to know 
the sequential algorithm (FEAS of [1]) that this parallel 
algorithm is based on. This sequential algorithm is given 
in Figure 2.6. For convenience, this has been translated 
into the graph language used here. The parallel version 
of this algorithm has complexity O(logn) and uses n/logn 
PEs [1]. 


line procedure FEAS(n,u,v) 
//Find a maximum matching of vertices in 
A onto the B set [u,v]. For every vertex icA, 
§;=u.// 
1 global MAT(1:n); set A; integer n,u,Vv,i,j 
2 sort A into nondecreasing order of h; 
3 MAT(1:n) « 0//initialize // 
4 jeu 
5 fori« 1fondo 
6 case 
%  :j>v: return({j) //all vertices in B matched// 
B :j<h,: //selecti// j«< j+1, MAT(i) <« 1 
9 end case 
10 end for 
11 return{j) 
12 end FEAS 
Figure 2.6 
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An examination of Glover's algorithm reveals that it 
performs exactly as does procedure FEAS when the res- 
trictions and simplifications applicable to FEAS are 
incorporated into it. 


Hence, for a leaf node of the computation tree, the 
MAXM set may be obtained by a direct application of pro- 
cedure FEAS (or its parallel equivalent). For example, 
for node 4 of Figure 2.4, we have A = M(4) = §2,3,4,7}; 
Ao=4; Ag=3; Ag=4; Ay=2; u=1; and v=3. Using FEAS 
(observe that this algorithm yields the same results 
regardless of whether the A; values or the modified 
values A’; of Figure 2.5 are used), we obtain 
MAXM(4)={7,3,2}. Note that MAXM consists of exactly 
those vertices i with MAT(i)=1. I{) consists of exactly 
those vertices i with MAT(i)=¢ and Aysv. The remaining 
vertices form T(). The matched set MAXM, transferred 
set T, and infeasible set I for each of the leaves in our 
example are shown in Figure 2.7. Null sets are not expli- 
citly shown, So, for node 4, I{4)=¢; T(4)={4}; and 
MAXM(4)={7,3,2}. The sets are ordered by hy. 


For a nonleaf node P, the MAXM, I, and T sets may be 
obtained by using the MAXM, I, and T sets of the children 
of P. Let Land R, respectively, be the left and right chil- 
dren of P. To determine MAXM(P), we first use procedure 
FEAS with u=up and v=vpg ([upr,vp| is the subset of B 
associated with the right child R of P). The A set consists 
of T(L) U MAXM(R). Since both T(L) and MAXM(R) are 
already sorted by h;, the sort of line 2 of FEAS can be 
replaced by a merge. Let S be the subset of T(L) U 
MAXM(R) that has MAT()=1 following the execution of 
FEAS. The following theorem establishes that MAXM(L) U 
S is a maximum cardinality subset of M(P) that may be 
matched with vertices in  [u,v].P. Hence, 
MAXM(P)=MAXM(L) U 8S. Following the determination of 
S, MAXM(L) and S are merged to obtain MAXM(P) in non- 
decreasing order of h,. | 


Theorem 2.1: MAXM(L) U S as defined above is a max- 
imum cardinality subset of M(P) that may be matched 
with vertices in [u,v].P using algorithm MATCH. 


Proof: The proof is by induction on the height of the 
subtree of which P is the root (A tree consisting of only a 
root has height 0). If this height is 1, then MAXM(L) and 
MAXM(R) are maximum cardinality subsets of M(L) and 
M(R) that can, respectively, be matched by Glover's algo- 
rithm with vertices in [u,v].L and [u,v|.R. If this distance 
is greater than 1, then MAXM(L) and MAXM(R) satisfy this 
maximum cardinality matching requirement by induc- 
tion, 


As far as node P is itself concerned, we see that only 
vertices in M(L) are candidates for matching with ver- 
tices in {w,,u,] (recall that for vertices in M{R), the s, 
value exceeds uv,;). Furthermore, when Glover's algo- 
rithm is used with the A set being M(P) and the B set 
being [u;,vp] = [u,v].P, vertices in B are considered in 
the order w;,, u,+1,...,.Uz, Up,...Up. Hence, MAXM(L) is 
precisely the subset of M(P) that gets matched with 


vertices in [wz uz]. 


The candidates for the remaining vertices in B, ie., 
[up,vr] are clearly T(L) U M(R). From the way Glover's 
algorithm works, it is also clear that the vertices of T{L) 
U M(R) that will get matched to [uz,vp| are a subset of 
T(L) U MAXM(R). Let this subset be 8S’. We wish to show 
that 5 is a legitimate choice for 5’. First, we show that 8 
represents a feasible matching. Then, we shall show that 
S is in fact selectable by Glover’s algorithm. 


We know that MAXM{(R) can be matched into [up,vp]. 
Let Z be any such matching. Since 5 is selected by FEAS, 
we know that every vertex in 5 can be paired with a dis- 
tinct vertex in [uwp,vg] in a such a way that no vertex j in 
S is paired with a vertex with index greater than hj. Con- 
sider a pairing W that meets this condition, Now suppose 
that some vertex j in 5S is paired with a vertex q in 
[wp,.vz] with index less than s;. Clearly, j must be a 
member of MAXM(R) (as all vertices in T(L) have an s 
value less than uz). Suppose that j is matched to 7, in Z. 
So, q<j,. If 7), is free in W, then the pairing of j in W may 
be changed from q to 7,;. If 7, is not free, then suppose it 
is matched to jg. From the restriction on W, it follows 
that q<ji<hj,, If jgeT(L), then jg may be paired with q 
and j with 7, (since q<j,, sj,<q<hj,). If jgeMAXM(R), then 
suppose that j2 is matched to jg in Z. It is easy to see 
that j3#7,. If q=Js or js is free in W, then we may pair j 
with j, and jg with jg. If q is in the interval [s;,, Ay,], 
then we may pair j with 7, and jg with q. If qis not in this 
interval, then since q<hj,, q<sj,<j3. Note that the condi- 
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' Results of first pass. 


tion q<jJaii is preserved. This is needed in case 
JeitgcT(L). Now, let J, be the vertex paired with js in W. 
It should be clear that we can continue in this way and 
modify W so that j is paired with 7,, j2 with js, 74 with js, 
etc. In the new pairing, there is one fewer vertex of 8S 
that is paired with a vertex with smaller s value. 


Repeating the above construction several times, W 
can be transformed into a matching such that every ver- 
tex j¢S is matched to a vertex q in [uz, uz] such that 
s;=q<h,;. Hence, 5 represents a feasible matching. 


Let 5' (as defined earlier) be the subset of MAXM(R) 
U T(L) matched by Glover's algorithm to the vertex set 


[wp,vp]. We shall now proceed to show that S is a valid 
choice for S’. Let Z be any matching of MAXM(R) into 


[up,vp]. Let Y be a matching of S' in which all vertices in 
MAXM(R) n S’ are matched to the same vertex in [up,vp | 
as in the matching Z. Let W be a corresponding match- 
ing for 8. The existence of the matchings Y and Wis a 
consequence of the construction used to show the feasi- 
bility of 5. 


From the definition of 5’, it follows that S’'@S. Also, 
from the working of FEAS, it follows that Sc/S’. Let jes 
be a vertex with least A; such that j] € 5’. If no such j 
exists, then S=5’. Assume that j is matched to qin W. If 
q is free in Y, then S’ cannot be of maximum cardinality. 
So, let pcS’ be matched to qin Y. By definition of Y and 
W, p £8. Also, from the order in which FEAS considers 
vertices, Ap=h; (as otherwise, FEAS would consider p 
before j and select p for 8). Hence, S’'U$j}-)p} is also a 
subset selectable by Glover's algorithm (Since h,=h;, by 


ensuring j<p, Glover's algorithm will be forced to match } 
before p.). S’ U {j}-{p} agrees with S in one place more 
than does 8’. 


By repeating this interchange process, S' may be 
transformed into S with the result that 5 is also a max- 
imum cardinality subset of MAXM(R) U T(L) that is select- 
able by Glover's algorithm for matching in [wp,vz]. 


Hence, MAXM(L) U S is a maximum cardinality sub- 
set of M(P) selectable by Glover's algorithm for matching 
in [u,v].P. = 


Once MAXM(P) is known, T(P) and I{P) are easily 
computed. Actually, as [(P) is never used, we may omit 
its computation. Figure 2.7 shows the MAXM, I, and T 
sets (except when empty) for all nodes in our example, 


Pass 2 


In the second pass, for each vertex P of the computation 
tree, we compute a set MAXM'(P) which represents the 
set of A vertices matched by Glover's algorithm with the 
set [u,v].P. With respect to the matching shown by solid 
lines in Figure 2.1, we see that if P is the root, then 
MAXM'(P) = $1, 2, 3, 4, 5, 7, 8, 9, 10, 11, 12, 14}; if P is 
node 3 of the computation tree, then MAXM'(P) = {8, 9, 
10, 11, 142. 
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If P is the root node, then MAXM'(P)=MAXM(P), by 
definition of MAXM(P). Let P be any nonleaf node for 
which MAXM'(P) has been computed. Let L and R be the 
left and right children, respectively, of P. Let [u,v].L = 
{az ,vz] and [u,v].R = [uwp,vp]. Let V = $j ©MAXM'(P) and 
s;<uz,}. Let W be the ordered set obtained by merging 
together V and MAXM(L) (note that both V and MAXM(L) 
can be maintained so that they are in nondecreasing 
order of A; and that W is also in nondecreasing order of 
h,). MAXM’(L) consists of the first min dW, uy -uz, +1} ver- 
tices in W. The correctness of this statement may be 
established by induction on the level of P. MAXM'(R) is 
readily seen to be MAXM’(P)-MAXM'(L). Figure 2.8, shows 
the MAXM’( ) sets for all the vertices in the computation 
tree of our example. 


From the MAXM'( ) sets of the leaves, the matching 
is easily obtained. If P is a leaf, and [u,v].P=[a,b], then 
the first vertex in MAXM’{P) is matched with a, the 
second with ati, etc. (note that MAXM’'(P) is in nonde- 
creasing order of h,). The matching for our example is 
also given in Figure 2.8. 
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matching={ (7,1), (3,2), (2,3), (4,4), (1,5), (5,6), (12,7), (9,9), (8,10) 
(11,11), (14,12), (10,13) } 


igure 2.8 Second pass 


183 


Complexity Analysis 

The initial ordering of A by s; and within s,; by A; can be 
done in O{log*n) time using n/2 PEs{[11] and [12]). Dur- 
ing the first pass, the computation of MAXM() requires 
the use of FEAS and a merge. The use of FEAS (without 
the sort) takes O(logn) time and_ requires 
O(M(P)\/logM(P)) PEs [1]. The merge at node P takes 
O(logn) time with M(P)/2 PEs. Since, MAXM can be com- 
puted in parallel for all nodes on the same level of the 
computation tree, O(logn) time is needed per level. The 
total time for the first pass is O(log*n) and n/2 PEs are 
needed, Pass 2 requires only some merging per node. 
The total cost of this pass is also O(log*n) and n/2 PEs 
suffice. 


Hence, the overall complexity of our parallel algo- 
rithm for maximum matching in convex bipartite graphs 
is O{log*n). The PE requirement is O(n). 


Another complexity measure often computed for 
parallel algorithms is the effectiveness of processor utili- 
zation (EPU) (see [1], [2], and [14]). For any problem P 
and parallel algorithm A, this is the ratio of the complex- 
ity of the fastest known algorithm for P and the product 
of the complexity of A and the number of PEs used by A. 


For our algorithm, we have an EPU that is 
O((n+mA(m))/(log@n*n)) (recall that m=). 


3. Conclusions 


This paper has further enhanced the utility of the binary 
tree method of Dekel and Sahni [2] for the design of 
parallel algorithms. It should also be pointed out that 
while all of our complexity analyses have assumed the 
availability of as many PEs as needed, our algorithms 
can be used when fewer PEs are available, The complex- 
ity of each algorithm will increase by no more than the 
shortfall in PEs. So if only half the number of PEs is avail- 
able, then the time needed will at most double (except 
for a possible constant increase in overhead). 


The parallel matching algorithm developed here can 
be used to obtain efficient parallel algorithms for several 
scheduling algorithms. These algorithms are developed 
in [15]. 
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asstract (1) 


This experiment has determined an 
optimum problem solving strategy for the 
consistent labeling problem. One combi- 
nation of factors, depth first search 
strategy-transmit large problems-transmit 
50% of a processor”’s work, was found to 
be statistically best, especially for 
large problem sizes or for architectures 
with restricted communications paths. 
Future work involves experimentation to 
understand the architecture related fac- 
tors. The results in this paper indicate 
that the performance of the system, even 
using the optimum problem solving stra- 
tegy, will vary considerably with archi- 
tecture. 


I. INTRODUCTION 

Combinatorial problem solving under- 
lies numerous important problems in areas 
such aS operations research, non-parame- 
tric statistics, graph theory, computer 
science, and artificial intelligence. 
Examples of specific combinatorial prob- 
lems include, but are not limited to, 
various resource allocation problems, the 
travelling salesman problem, the relation 
homomorphism problem, the graph clique 
problem, the graph vertex cover problen, 
the graph independent set problem, the 
consistent labeling problem, and proposi- 
tional logic problems [12-15]. These 
problems have the common feature that all 
known algorithms to solve them take, in 
the worst case, exponential time as prob- 
lem size increases. They belong to the 
problem class NP. 


This paper describes the interaction 
between specific algorithm parameters and 
the parallel computer architecture. The 
classes of architectures we consider are 
those which have inherent distributed 
control and whose connection structure is 
regular. 


(1)This work was supported in part by the 
Office of Naval Research Grant 
N0O0014-80-C-0689. 
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Combinatorial problems 
tions which do searching. 
describing the parallel 
search, we associate with the space yet 
to be searched the term "the current 
problem." A representation mechanism 
which can partition the space yet to be 
searched can divide the current problem 
into mutually exclusive subproblems. 


require solu- 
To help in 
combinatorial 


Now suppose that one 
parallel computer is given a combinato- 
rial problem. In order to get other pro- 
cessors involved, the processor divides 
the problem into mutually exclusive sub- 
problems and gives one subproblem to each 
of the neighboring processors, keeping 
one subproblem itself. At any moment in 
time each of the processors in the paral- 
lel computer network may be busy solving 
a subproblem or may be idle after having 
finished the subproblem on which it was 
working. At suitable occasions in the 
processing, a busy processor may notice 
that one of its neighbors is’ idle. On 
such an occasion the busy processor 
divides its current problem into two sub- 
problems, hands one off to the idle 
neighbor and keeps one itself. 


processor in a 


The key points of this description are 


1. the capability of problem division 

2. the ability of every processor’ to 
solve the entire problem alone, if it 
had to. 

3. the ability of a busy processor to 
transfer a subproblem to an idle 
neighbor. 

The parallel computer architecture 


research issue is: to determine that way 
of problem subdivision which maximizes 
computation efficiency for each way of 
arranging a given number of processors 
and their bus communication links. 

issue 


To define this research 


cisely requires 


pre- 


1. that we have a systematic parametric 
way of describing processor/bus 


arrangements and 


2. that we have alternative problem sub- 


division techniques. 


This paper addresses the interaction 
between the processor/bus graph and prob- 
lem size subdivision 
Once these relationships are 
and expressed mathematically, 
lel computer architecture 
becomes less of an art 
mathematical optimization. 


determined 
the paral- 
design problem 
and more of a 


Our ultimate goal is to allow computer 
engineers to begin with the combinatorial 
problems of interest and determine via a 
Mathematical optimization, the optimal 
parallel computer architecture to solve 
the problems assuming that the associated 
combinatorial algorithms are given. 


Il. 


PROCESSOR-BUS MODEL 


In this section we discuss 
sor-bus model which can be used to model 
all known regular parallel architectures 
[1,3,4,7,8,10,21-26]. The model does not 
currently include the general intercon- 
nection and shuffle type networks. 


a proces- 


The graphical basis for the model is a 
connected regular bipartite graph. A 
graph is bipartite if its nodes can be 
partitioned into two disjointed subsets 
in such away that all edges connect a 
node in one subset with a node in the 
second subset. A graph is connected if 
there is a path between every pair of 
nodes in the graph. A bipartite graph is 
regular if every node inthe first set 
has the same degree and every node in the 
second set has the same degree. One sub- 
set of nodes represents the processor 
nodes and one subset represents the com- 
munication nodes in the parallel process- 
ing system. Every edge in the graph then 
connects a processing node to a communi- 
cation node. 


Any regular bipartite graph can be 
used to design a parallel computer struc- 
ture by asSigning the nodes in one set to 
be processors and the nodes in the other 
set to be communication links (or buses). 
Notice that theoretically either set of 
the bipartite graph could be the proces- 
sor set. Therefore, each unlabeled 
bipartite graph represents two distinctly 
different computer architectures depend- 
ing upon which set is considered to be 
processors and which set is considered to 
be the buses. 


The notation B(n_,d_,n_,d_) will be 
used to denote a regulfr Bipartite graph 
which represents an architecture with n 
processors (each connected to d communiP 
cation nodes) andn communicaPion nodes 
(each servicing da processors). The 


transfer mechanism. 


ful because each processor 
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Boolean 3-cube will then 
by a graph B(8,3,12,2). 
Boolean n-cube 
graph B(2,n, 
assignment of 
buses produces 
which is called 
tigators. 


be represented 
In general, the 
will be represented by a 
n2 , 2). Reversing the 
nodes to processors) and 
the B(12,2,8,3) graph 
the p-cube by some inves- 


Other common architectures also have 


representations as bipartite graphs For 
example, a planar array of size x” con- 
nected in the on Neumann manner is 
modeled as a B(x ,4,2x ,2) graph, the 
Moose gonnection results in a 
B(x’ ,8,4x ,2) graph, the common bus 


architecture ( or star) with x processors 
is a B(x,l,1,x) graph, and the common 
ring architecture is a B(x,2,x,2) graph. 
All existing architecures with regular 
local neighborhood interconnections can 


be modeled as a B(n rd ng rd.) graph. 


IT. 


PROBLEM SOLVING FACTORS 


Introduction to Tree Searching 


effective use of a 


processor for any 


In order to make 
multiple asynchronous 
problem, a major concern is how to dis- 
tribute the work among the processors 
with a minimum of interprocessor communi- 
cation. Kung [14] defines module granu- 
larity as the maximal amount of computa- 
tional time a module can process without 
having to communicate. Large module 
granularity is better because it reduces 
the contention for the buses and reduces 
the amount of time a processor is either 
idle or sending or receiving work. Also, 
large granularity is usually better 
because of the typically fixed overhead 
associated with the synchronization of 
the multiple processors. 


In the combinatorial tree search prob- 
lems we are considering, module granular- 
ity as defined by Kung is not as meaning- 
could in fact 
solve the entire problem by itself with- 
Out communicating to anybody. For our 
problem a more appropriate definition of 
module granularity might be the expected 
amount of processing time or the minimum 
amount of processing time before a pro- 
cessor splits its problem into’ two sub- 
problems, one of which is given to an 
idle neighboring processor and one of 
which is kept itself. 


has finished search- 
the tree required to 
subproblem, it must wait for 
to be transferred from another 

The amount of time a proces- 


When a processor 
ing that portion of 
solve its 
new work 
processor. 


sor must wait before transmission begins 
and until transmission is completed is 
time wasted in the parallel environment 
that would not be lost in a_ single pro- 
cessor system. Thus, one must expect 
improvement in the time to completion to 
solve a problem in the multiple processor 
environment to be less than proportional 
to the number of processors. The factors 
that can affect the performance by either 
reducing the average transmission time or 
reducing the required number of transmis- 
sions include choice of algorithm, choice 
of search strategy, and choice of sub- 
problems that busy processors transfer to 
idle processors. 


Choice of Algorithm 


In the single processor case, various 
algorithms have been proposed and studied 
to efficiently solve problems’ requiring 
tree searches. These usually involve 
investing an additional amount of compu- 
tation at one node in the tree in order 
to prune the tree early and avoid need- 
less backtracking. In work on constraint 
Satisfaction [11], the forward checking 
pruning algorithm was found to perform 
the best of the six tested and backtrack- 
ing the worst. 


For the same reasons, it seems clear 
that pruning the tree early should be 
carried over to a multiple processor sys- 
tem to reduce the amount of computation 
necessary to solve the problem. There 
are other reasons as well. Failure to 
prune the tree early may later result in 
transfers to idle processors of problems 
which will be very quickly completed. 
Since a transfer ties up, to some extent, 
both the sending and receiving processor, 
time is lost doing the communication and 
the processor receiving the problem would 
shortly become idle. 
expect that in 


We would, therefore, 


the multiple processor environment’ the 
forward checking pruning algorithm for 
constraint satisfaction would work much 
better than backtracking. However, in 


the uniprocessor environment Haralick and 
Elliott also showed that too much’ look 
ahead computation at a node in the search 
could actually increase the problem com- 
pletion time. It is not clear that this 
would be true in the multiple processor 
case. It may be best to do more testing 
early reducing future transfers, communi- 
cation overhead, and delay in contrast to 
the single processor case where only some 
extra testing has been found to be worth- 
while. | 


consideration in the selec- 

algorithm is the amount 
that must be transferred 
processor to specify a 


A second 
tion of a search 
of information 
to an idle 
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subproblem and any associated lookahead 
information already obtained pertinent to 
the subproblem. In most cases this is 
proportional (or inversely proportional) 
to the complexity of the problem remain- 
ing to be solved. Thus the transmission 
time will be a function of the problem 
complexity. Backtracking requires very 
little information to be passed while, 
for forward checking, a table of labels 
yet to be eliminated must be sent. 


Search Strategy 


Search strategy is a second factor of 
importance to the multiple processor 
environment. When a problem involves 
finding all solutions, like the consis- 
tent labeling problem, the entire tree 
must be searched. Thus, in a uniproces- 
sor system the particular order in which 
the search is conducted, i.e., depth 
first or breadth first, has no effect. 
In a multiple processor system, however, 
this is acritical factor because it 
directly affects the complexity of the 
problems remaining in the tree to be 
solved and available to be sent to idle 
processors from busy processors. 


A depth first search will leave high 
complexity problems to be solved later 
(that is, problems near’ the root of the 
tree.) This would seem to be desirable 
in the multiple processor environment 
because passing such a problem to an idle 


processor would increase the length of 
time the processor could work before 
going idle and thereby reduce the need 


On the other hand, a 
breadth first search would tend to pro- 
duce problems of approximately the same 
size. Since the problem is not completed 
until all processors are finished, the 
breadth first strategy might be prefera- 
ble if it results in all processors fin- 
ishing at about the same time. It might 
be that the best approach could be some 
combination of the two; for example, one 
might follow a depth first strategy for a 
certain number of levels, then go breadth 
first to a certain depth, and then con- 
tinue depth first again. 


for communication. 


Problem Passing Strategy 


A factor closely related to the search 
strategy occurs when a processor has a 
number of problems of various complexi- 
ties to send to an idle processor. The 
optimization question is how many should 
be sent and of what complexity(ies). 
Further complicating this is a situation 
where the processor is aware of more than 
one idle processor. In such a situation, 
how should the available work be divided 
and still leave a significant amount for 
the sending processor? 


Further complicating this question is 
the fact that the overhead involved in 
synchronizing the various processors and 
transmitting problems to idle ones will 
eventually reach a point where it will be 
more than the amount of work left to be 
done. An analogous’ situation exists in 
sorting; fast versions of QUICKSORT even- 
tually resort to a simple sort when the 
amount remainin to be sorted is small 
[13]. oo 


In this case, it would appear that a 
point will eventually be reached where it 
is more effective for a processor simply 
to complete the problem itself rather 
than transmit parts of it to others. 
Determination of this point will depend 
on the depth in the tree of the problem 
to be solved and the amount of informa- 
tion that must be passed (which depends 
on the lookahead algorithm being used.) 


Processor Intercommunication 


that has’ to be 
how the need to transfer work is recog- 
nized. Specifically, does a processor 
which has no further work interrupt a 
busy processor, or does a processor with 
extra work poll its neighboring proces- 
sors to see if they are idle. 


One decision made is 


The advantage of interrupts is that as 
soon aS a processor needs work, it can 
notify another processor instead of wait- 
ing to be polled. This assumes, however, 
that a processor would service the inter- 


rupt immediately instead of waiting until 


it had finished its current work. A 
disadvantage is that when a processor 
goes idle, it cannot know which of its 


neighbors to interrupt. Using polling, 
an idle processor can be sent work by any 


available neighboring processor instead 
of being forced to choose and interrupt 
one. In addition, although an inter- 


rupted processor may be working or tran- 
smitting (a logical and necessary condi- 
tion) when interrupted, it may not have a 
problem to pass when it is time to pass 
work to the interrupting processor. In 
fact, the interrupted processor could 
itself go idle. For these reasons the 
Simulation we discuss in section IV uses 
polling. Whenever a processor completes 
a node in the tree, and as long as it has 
work it could transfer, it checks each 
neighboring CPU and the connecting bus. 
If both are idle, a transfer is made. 
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problem sizes (small and medium) 


IV. SIMULATION EXPERIMENTS 

In order to better understand the 
behavior of the tightly coupled asynchro- 
nous parallel computer, we have designed 
a series of simulation experiments using 
the consistent labeling constraint satis- 
faction problem. The simulation used to 
perform these experiments was written in 
SIMULA [Birtwistle,, Myhrhaug & Nygaard, 
1973]. Let U andtb be finite sets Let 
RC (Ux L)”. We use the simulated par- 
allel computer to find all functions 
f: U -> L satisfying that for all (u,v) €& 
UxU, (u,f(u),v, £(v)) E€ R. The goal of 
the experiments is to determine which 
architectural and which problem related 
factors are significant enough to warrant 


further investigation. This paper pre- 
sents the results for problem related 
factors. 


In this experiment each problem factor 
was tested at two levels. The factors 
and levels tested are given in Table l. 
Based on previous experiments [16], it 
was very clear that forward-checking was 
significantly better than backtracking so 
all experiments used the forward-checking 
algorithm [11]. In order that the 
results’ be applicable for different 
architectures and problem sizes, two 
and two 
very different architectures (in terms of 
the number of communication paths) were 
used. The architectures chosen were sym- 
metric to eliminate the need for assump- 
tions about the architecture related fac- 


tors discussed earlier. The ring 
architecture, B(64, 2, 64, 2), due to the 
limited interconnection structure, will 


have difficulty passing work from the 
initial processor to distant processors. 
The Boolean 6-cube B(64, 6, 192, 2), 
should be able to effectively utilize 
most of the 64 processors. Finally, one 
replication was run of each combination. 
This involves running the simulation with 
different random number seeds’ to create 
Statistically equivalent combinatorial 
problems. An analysis of variance was 
used to determine the significance of the 
problem related parameters and to deter- 
mine interactions of the parameters [20]. 
The measure of performance used was the 
time until the problem was solved. 


Results 

The analysis of variance was done 
using the SAS (Statistical Analysis Sys- 
tem) package. The analysis showed sta- 


tistically significant differences in the 
means (at a level of 0.0001), and second 
and third order interactions for the 
search strategy, size passed, and number 
passed. The means for the two cutoff 
point levels were not statistically dif- 


ferent. Because the 
interaction among strategy, 
number was significant, 
of these three factors were treated as 
eight levels of one combined factor for 
further analysis. 


three way 
size, and 
the combinations 


Duncan’s multiple range test was per- 


formed [20] to divide the levels’ into 
groups with similar performance. The 
results, based on the average time to 


completion for the different experimental 
conditions, are shown in Table 2. 


The key result is that one combination is 
clearly superior, depth-large-50%, and 
should be used in further experiments. 
(This combination also produced the low- 
est mean for each of the four architec- 
ture-problem size pairs.) 


There is a logical explanation for the 
groupings. For each factor one value can 
be classified as positive (i.e., it 
should contribute to improved performance 
regardless of other factors), and _ the 
other negative (i.e., it should result in 
poorer performance). The positive fac- 
tors are indicated as level 1 in Table l. 
For example, passing more than one sub- 
problem or passing large sub-problems 
should be preferable as the idle proces- 
sor should stay busy longer. Since ina 
depth first search a processor works on 
small problems, this should leave larger 
problems to pass. As a result communica- 
tion time is reduced. 


Using this idea of a positive level 
for each factor, only one combination has 


all 3 levels positive, three have’ two 
positive, three have one positive, and 
one no positive levels. The grouping 


produced by Duncan’s test 
analysis and, in fact, 
partition. Thus, 


confirms this 
produces a finer 
the interaction between 
these factors agrees with the analysis. 
The analysis of variance also indicated 
Significant interactions between the com- 
bined factor and the experimental condi- 
tions of problem size and architecture. 


To best understand these interactions, 
the values were plotted as suggested by 
Cox [6]. (Figures 1,2,3). If there were 


no interaction, then the 
figure would be parallel. 


curves in each 


Figure 1 shows a clear interaction 
between problem size and architecture. 
For a small problem, a small number of 
processors is sufficient; thus, the ina- 


bility of the ring to spread sub-problems 
to idle processors is not a severe handi- 
cap. However, for a larger problem, the 
performance of the ring is much worse 
than that of the 6-cube which is able to 
involve many more of the processors. In 
each case the time to completion was 
approximately 3 times longer in the ring 
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architecture, Since the degree of each 
processor node in the ring is 1/3 of the 
degree of each processor node in the Boo- 
lean 6-cube, it appears that performance 
may be proportional to the degree of the 
processor nodes. This has intuitive 
appeal because more communication paths 
should improve the ability of processors 


to keep busy. Later experiments will 
confirm or deny this conjecture. It is 
also possible that diminishing returns 


May set in for extremely large numbers of 
communication nodes. This plot indicates 
that the use of an optimum architecture 
becomes more crucial for large problems. 


interaction of the 
solving factor with 
problem size. Clearly, the need to det- 
ermine the best combinations of problem 
solving factors becomes more critical as 
the size of the problem increases because 
a bad choice has a greater detrimental 
effect on the larger problem. 


Figure 2 shows the 
combined problem 


Figure 3 shows the interactions of the 
combined problem solving factor with 
architecture type. This plot shows that 
an optimum choice of problem-solving fac- 
tors tends to reduce the effects of a bad 
choice of architecture. However, the 
difference in performance between the 
architectures using the optimum problem 
solving strategy is still a factor of 3, 
so that further experiments to determine 
an optimum architecture seem justifiable. 
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Table 1 —- Experiment Summary 


FACTORS TESTED 


Addison-Wesley Publishing, Reading, 
MA, 1973. 
FACTOR 


search strategy 


size of sub-problem 
passed 


number of sub-problems 
passed 


cutoff point none 


LEVEL l 


largest 


depth-first 


50% of expected 
total work 


LEVEL 2 
breadth-first 


smallest 
1 sub-problem 


4 units to be tested 


EXPERIMENTAL CONDITIONS 


Architecture Ring 
number of processors 64 
number of buses 64 

Size of combinatorial small - 12 


problem 


One replication. 


random 


units & labels 


medium - 16 
units & labels 


random 


Table 2 - Duncan’s Multiple Range Test 


GROUPING* MEAN ID FACTOR COMBINATION 
COMPLETION NUMBER SEARCH SIZE NUMBER 
TIME 
A 2,705,274 8 breadth small one 
B 1,874,887 7 breadth large one 
Cc 689,372 4 depth small one 
D 451,133 6 breadth small 50% 
E 335,267 5 breadth large 50% 
FE 301,774 3 depth large one 
F 247,667 2 depth small 5.0% 
G 147,181 1 depth large 50% 


*means with the same grouping are not Significantly different 


NO 
O 


Significance level = 0.05 


FIGURE 1 
PROBLEM SIZE AND 
ARCHITECTURE 


RING 


OOOs) 


COMPLETION TIME (in 100, 


oO 


N 


0O 


dS 
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16 


PROBLEM SIZE-#UNITS= # LABELS 


COMPLETION TIME (IN 100,000's) 


20 


“FIGURE 2 


PROBLEM SIZE AND 
PROBLEM-SOLVING FACTORS 
VS, COMPLETION TIME 


SIZE=12 
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COMBINED FACTOR ID NUMBER 


FIGURE 3 
ARCHITECTURE AND 
PROBLEM-SOLVING FACTORS 
VS. COMPLETION TIME 
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NOVAC - A NON-TREE VARIABLE TREE FOR COMBINATORIAL COMPUTING 


B.C. Desai 
J. Opatrny 


Cc. Lam 
P. Grogono 


J.W. Atwood 
S. Cabilio 


Computer Science Department 
Concordia University 
Montreal, Quebec, Canada 


Abstract - - In exact computation, a number of 
problems exist, the solution to which demands an 
exhaustive search and hence a great deal of 
computing time. The algorithms used are simple 
but the computation involved is so great that it 
cannot be done economically on a large scale 
time-shared general purpose computer. The 
present multiprocessor project at Concordia 
consists of a dynamically variable virtual tree 
structured system for solving a class of combina- 
torial problems. The proposed multiprocessor 
structure consists of loosely coupled processors 
with no shared memory. Each processor in the 
system can be a master or a slave or both, and 
under certain conditions a master processor can 
become a slave of its own slave processor. A 
master assigns tasks to the slaves and subsequent- 
ly obtains results from them. The nature of the 
problem being solved and the high bandwidth of 
the interprocessor communication bus is expected 
to cause inappreciable degradation due to conten- 
tion. The user expresses the problem being 
solved in a high level language called Pascal-C; 
this is conventional Pascal with a number of 
additional constructs including synchronization 
statements. Another part of the project involves 
designing the extensions to an existing operating 
system to support this dynamically variable 
structure and the runtime system of Pascal-C. 


Introduction 


There are many problems in exact computation 
requiring a great amount of computing time to 
solve them. The computations involved are simple 
however the amount of computation is so great 
that it cannot be done economically on a large 
scale general purpose computer. Parallel proc- 
essing of subproblems derived from a large class 
of problems on multi-microprocessors is becoming 
increasingley feasible economically. Interproc- 
ess communication of these processors is imple- 
mented by an interconnection network. A number 
of surveys of interconnection networks have 
appeared in relevant literature, eg. [7]. Numer- 
ous systems have been proposed to exploit the 
parallelism in such problems. A computer 
structure in the form of a tree has been proposed 
in [1]; a system to solve problems that may be 
expressed with recursive algorithms is presented 
in [4]. In [5], a microprocessor based system 
has been described for the 0-1 programming 
problem. A number of adaptive computer architec- 
ture schemes have been proposed recently; [9,12] 
are examples of such systems. However, many of 
these systems are in the developmental or 
experimental stage and/or are too expensive for 
general availability. 


The NOVAC project at Concordia consists of 
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a loosely coupled, non-tree structured multi- 
processor system with the potential of being 
dynamically structured into a virtual variable 
tree to solve a class of combinatorical problems. 
This system is to be built with off-the-shelf 
mini and micro computers, and interconnected 
using an inexpensive asynchronous bus. The 
logical structure, consisting of a hierarchy of 
masters, each with a number of slaves is natural 
for the set of problems which can be split-up 
into a number of identical sub-problems. 


Novac Hardware 


The hardware, Figure 1, consists of a 
number of PDP/11 based processing systems; each 
processor in the system has its own private 
memory and is under control of its own operating 
system. Each processor executes independently 
and communication between processors is via the 
common Novacbus. Each processor thus, has the 
same physical status as any other processor. 

The initial system consists of a PDP-11/34 and 
several LSI-11/23 processors. The PDP-11/34 is 
equipped with conventional peripherals (terminal, 
printer, and disk drives) an an UNI (Unibus to 
Novacbus) interface to the Novacbus, but the 
LSI-11/23 has only a UNI interface, and as such 
can only communicate with the other processors 
in the system. 


The, proposed system has the following fea- 
tures (i) the problem presented to it can be sol- 
ved by the same program code (a copy of the code 
is resident in the memory of each processor); 
(ii) the logical tree structure with a master- 
slave relationship of the processors: the master 
assigns the subtasks to the slaves; the slaves 
can act as masters and divide their tasks and 
assign them to their slaves; (iii) there is a 
main master at the "root" of the tree and the 
user communicates his problem via this master; 
(iv) there is no shared memory and communication 
is limited between the master and slave; (v) no 
communication exists amongst the slaves; (vi) 
the amount of communication between the master 
and slave is not extensive; (vii) the processors 
are interconnected via the UNI interface to the 
Novacbus. 


Since the amount of communication between 
the processors is limited, a single asynchronous 
bus of high bandwidth to serve the maximum 
number of processors is proposed; the high band- 
width keeps the contention for use of this bus 
low. The proposed channel will support a 
hierarchy of communication protocols from high 
level virtual communication between programs, 
to low-level physical communication between 
hardware units. 


This system which has only one interconnec- 
tion per processor has the following drawbacks. 
In applications where interprocessor communica- 
tion is very high the system will degrade con- 
siderably. However, in compute bound situations 
where the ratio of local processing to interpro- 
cess communication requirements is high (ie. 


where for each word of interprocess communication, 


the number of instructions executed is of the 
order of 10° to 10°) this interconnection will 
allow a large number of processors to be inter- 
connected. Each processor in this system has 
its own copy of the program code which makes 
inefficient use of memory. In addition, data 
must be transmitted from the master processor to 
the slave processor instead of pointers. 
in the applications considered for NOVAC, where 
the amount of communication is limited, the 
transmission time for interprocessor communica- 
tion is expected to be low. 


The proposed interconnection is simple and 
inexpensive while providing for modularity. The 
communication protocol is simple to set up and 
control. 


Pascal-C 


The design of the language for the multi- 
processor system is based on the following 
objectives and assumptions: 


1. Programs for the system can be developed in 
a familiar high-level language which has been 
augmented with only a few new constructs; 


2. There should be specific language constructs 
to allow efficient and simple utilization of 
*the processors in the system; 


3. Synchronization of processes is simple or 
even unnecessary in most computationally 
bound combinatorial problems. 


Since none of the well known languages for con- 
current programming eg., Concurrent Pascal [2], 
Modula [11], Ada [8], Edison [3] satisfied our 
specific requirements, we decided to use Pascal-C 
which is Pascal, augmented by these three 
additional constructs: down procedures, critical 
procedures, and synchronization statements. The 
syntax and the usage of these constructs are 
given below, however the details of these exten- 
sions are given in [10]. 


The procedure and function declaration part 
of a block in Pascal-C may include declarations 


of critical procedures and down procedures. 


<critical procedure declaration>::=critical 
<procedure declaration> 


<down procedure declaration> ::=down 
<procedure heading> 
<copy section> 
<block> 


<copy section> ::=copy<identifier>{ ,<identifier>} 
| <empty> 


However , 
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cedure. 


Thus, in the declaration of a critical procedure 
the keyword critical precedes the keyword pro- 
In the declaration of a down procedure 
the keyword down precedes the keyword procedure. 


“Furthermore, the heading of a down procedure can 


be followed by the keyword copy and a list of 
identifiers containing global variables, proce- 
dures and functions which can be used in the 
statements of the down procedure. 


The following synchronizing statements are 
available in Pascal-C: 


<wait statement>::=wait (<identifier> 


{ ,<identifier>}) 


<terminate statement<::=terminate (<identifier> 
{ ,<identifier>}) 


where <identifier> must be a name of a down 
procedure. 


In addition to the scope level as usually 
defined in Pascal, we will also define for each 
element of the language, (e.g. variable, func- 
tion, procedure, down procedure) its process 
level. This process level indicates the nesting 
level with respect to down procedures. The 
critical and down procedures cannot be recursive; 
in addition the critical procedure cannot be 
nested or call another critical procedure, and 
all its parameters must be value parameters. 


An invocation of a down procedure creates 
an independent concurrent process in a slave 
processor. A critical procedure call in a slave 
processor creates a new process in the master 
processor. Critical procedures are used by a 
slave to pass its results back to the master. 
Synchronization statements allow the process in 
a master to wait for the results of its slave(s), 
or to terminate processes in its slaves. Down 
procedure statements and critical procedure 
statements are used to invoke down procedures 
and critical procedures, respectively. The 
syntax of down procedure statements and critical 
procedure statements is identical to the syntax 
of ordinary PASCAL procedure statements. The 
actual parameter of a down procedure correspon- 
ding to a variable parameter must be a variable 
whose scope includes the down procedure. 


The present design of Pascal-C does not 
allow for the use of pointers as variable para- 
meters to down procedures or in copy sections. 


Pascal-C has been used successfully to 
program and dry run several different kinds of 
combinatorial problems. 


Implementation of Novac 


Here are the three major areas in which 
implementation effort is underway: 


1. Assembly of the NOVAC-Tree hardware units. 
This will be from off-the-shelf systems 
with very little additional hardware to 
provide interprocessor communication. 


Zi. 


Construction of an operating system (0S) 
that provides the multiprocessing 
capabilities required for the NOVAC-Tree 
system and the Runtime system (RTS) of 
Pascal-C. The OS is based on RT-11 
which allows the processors connected to 
the NOVAC bus to appear as ordinary 
peripherals. The RT-11 OS will require 
additional features; for example, it 
must support the mapping between virtual 
and real slave processors; it must allow 
atomic execution of critical procedures. 


Construction of a compiler and runtime 
system for Pascal-C. The proposed 
language closely resembles standard 
Pascal, and it should be possible to 
adapt an existing Pascal compiler to 
the requirements of the project. Most 
of the support for non-standard features 
is provided by the RTS, and a substan- 
tial proportion of the language imple- 
mentation effort will be directed to- 
wards the RTS. 


The progress of this project and experience 


gained from it will be revealed in a future 
paper. 


[1] 


[2] 
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RESULTS IN PARALLEL SEARCHING, MERGING, AND SORTING 
(Summary ) | 
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Introduction 


Comparison problems such as merging and sort- 
ing are of fundamental importance in computer 
science, and much effort has been devoted to 
finding efficient algorithms for solving these 
problems on sequential processors. Recently a 
similar effort has been devoted to to solving 
these problems on parallel processors [1] - [7]. 
In this paper we present parallel algorithms for 
searching, merging, and sorting which have good 
worst-case performances. 


Initially, our algorithms are presented under 
Valiant’s model of parallel computation [7] which 
captures the inherent difficulty of solving com 
parison problems (the notation has been changed 
slightly to be consistent with ours): 


eee there are P processors available, and 
therefore P comparisons can be performed 
Simultaneously. The processors are syn- 
chronized so that within each time inter- 
val each of them completes a comparison. 
At the end of the interval the algorithm 
decides, by inspecting the ordering rela- 
tionships that have already been esta- 
blished, which P (not necessarily dis-~ 
joint) pairs of elements are to be com 
pared during the next interval, and 
assigns processors to them. The computa- 
tion terminates when the relationships 
that have been discovered are sufficient 
to specify the solution to the given 
problem. 


Under this model Valiant presented algorithms for. 


merging and sorting that were faster than any 
previously known. 


More realistic models of parallel computation 
are shared-memory machines. One particular model 
of shared-memory machine is the CREW P-RAM which 
in a single cycle allows the processors to per- 
form concurrent reads from the same location but 
not concurrent writes (Concurrent Read Exclusive 
Write Parallel Random Access Machine). Although 
shared-memory machines with single cycle memory 
access are not physically constructible at 
least for very large numbers of processors -~- the 
study of their performance can be a powerful tool 
for gaining insight into the nature of parallel 


This work was supported in part by the National 
Science Foundation under Grant No. NSF-MCS81- 
05896. 
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that Valiant’s merging and 


Theorem l. 


Borodin and Hopcroft [1] showed 
sorting algorithms 
can, in fact, be implemented on a CREW P-RAM. 


computation. 


We improve on the results of Valiant. For 
example, we present a merging algorithm that is 
optimal up to a constant factor when merging two 
lists of equal size, independent of the number of 


processors; in particular with N processors it 
merges two lists each of size N in 
1.893 lglgN + 4 comparison steps. We then use 


the merging algorithm to obtain a sorting algo- 
rithm which, in particular, sorts N values with N 


processors in 1.893 TST 
terms) comparison steps. All of our algorithms 


can be implemented on a CREW P-RAM. 


(plus lower order 


We define Search, (N) to be the number of com 


parison steps required by P processors to search 
a sorted list of N elements for some specified 
value, Merge, (M,N) to be the number of comparison 


steps required by P processors to merge two 
sorted lists of sizes M and N, and Sort, (N) to be 


the number of comparison steps required by P pro- 
cessors to sort N elements. Throughout this 
paper, we write lgx for log,x and Inx for the 
natural logarithm log .x- 


Searching and Merging 


This section contains results for parallel 
searching and merging. The general outline of 
the section closely follows Valiant [7] (Section 
2), and several of the algorithms are improve- 
ments of Valiant’s. 


The following theorem generalizes the sequen- 
tial algorithm for searching a sorted list to the 
parallel case. 


Search, (N) < | oetNe | 


log(Pt1) |° Further- 


more, the bound is tight. 


Proof. We show by induction that in k comparison 
steps we can search a_ sorted list of size 
(pt1)*-1: The forma certainly holds for k=0. 


Assume it holds for k-l. Then to search a list 


of size (p+1)*-1, we can compare the element 
being searched for to the elements in the sorted 
list subscripted by i-(pt1)*! for 42 2 yee: 
There are no more than P such elements [since 


(P+1)(P+1)e~! = (p41) > (p41)*-1]. Thus the 


comparisons can be performed in one step, and the 
problem is reduced to searching a list of size 


(P+1) Fed. In general to search a list of size 
N in k comparison steps we need 
(pP+1)K-1 > N 
log(N+1) 
or Kk ? Tog(P+l) 
Log(Nt+1)' 
ss 2 een : 


We now show that the bound is tight. Given a 
sorted list of N elements, during the first step 
any algorithm can examine only P elements. Some 
segment of unexamined elements must have length 
at least 
N+1 


Spey te 


N-P 


N-P | N-P 
P+1 


P+1 


By induction after the k-th step the problem must 
have size at least - 1. Thus the number of 


steps required by Abe 21 gorithm is at least the 
minimum k for which 


N+1 
(p+1)* 


log(N+1) 
log(P+1) 


- 1 < O 


or k 2 


fiostuel) es 


aa log(P+1) 


log(N+1) _ 
Corollary 1. Merge,(1,N) < Ereasn - Furth 


ermore, the bound is tight. 


Corollary 2. For 1 <M <P andM <N, 


Log (N+1) 
Merge, (M,N) < Pence 


Proof. Assign | P/M | processors to each element 
in the smaller list and merge as in Corollary l. 


Corollary 3. Forl <P <M<«N, 


Merge, (M,N) < [M/P | f 1g(N+1) | 


Proof. Assign m/e | elements in the smaller 
list to each processor and merge as in Corollary 
I 


Theorem 2. For P = | wi t/k yi/k | , integer k 22, 


and 2<M<N, 


Merge, (M,N) < k 


k 
< eg ee PR 


Summary of proof. The Cae eine induc-— 
tively, by showing, given [mr proces- 


sors, how we can in k comparison steps reduce the 
problem of merging two lists of length M and N to 
the problem of merging a number of pairs of 
lists, where each pair’s shorter list has length 
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less than Wilk The pairs of lists are 


created that we can distribute the twit 1/k wie 
processors amongst them in such a way as to 
ensure that for each pair there will be enough 
processors allocated to satisfy the induction 
hypothesis. 


The above theorem is a generalization of 
Valiant’s Theorem 3. Valiant’s result is the 
special case of k=2, whereas the formula is 


minimized when k=3: 


Corollary 4. For P = lat tala and 2 <M <N, 


Merge, (M,N) < 3 aeege + 1 | 
ee 
< 13 lglgM + 4 


IR 


1.893 leleM + 4. 


The following corollary is very similar to 
Valiant’s Corollary 5. However, besides having 
the obviously better constant factor for k=3, the 
algorithm is slightly more natural and has a 
smaller additive constant. The proof is very 
similar to the proof of Theorem 2. 


pm frm ik yl lk | 


Corollary For and 


5 
2<r<Me«<n 


Merge, (M,N) < (lelgM-lglger) +k+1. 


Tek 


Corollary 6. For 2 < P < MW, 


Merge, (M,N) < 


ae. 
+ 1g3 IlglgP + 6. 


Corollaries 5 and 6 together define a merging 
algorithm which, for M=N and all P, is optimal 
up to a constant factor; this optimality is a 
consequence of the lower bound for merging given 
in [1] and the fact that no parallel algorithm 
can be more than P times faster than its sequen- 
tial counterpart. 


Basically, all of these algorithms can be 
implemented on an CREW P-RAM in time equal in 
order to their number of comparison steps -- see 
[1]. 

Sorting 


The merging algorithm of Corollary 6 allows us 
to obtain fast sorting algorithms by using an 
idea of Preparata [5]. In general, our sorting 
algorithms are enumeration sorts, i.e. the rank 
of an element is determined by counting the 
number of smaller elements. 


We present here the general idea of the sort- 
ing algorithm under the simplifying assumption 


that all variables are continuous. 
constant (dependent on P and Ne 
works as follows: 


Let G be some 
The algorithm 


If N=1 the list is sorted, while if P=1 sort in 
Sort, (N) comparison steps using the best sequen- 


tial sorting algorithm [it is well known that 


Sort 1) = NigN - O(N)]. Otherwise apply the 

fol lowlus procedure: 

(1) Split the processors into G groups with 
P/G > 1 processors in each, and split the 
elements into G groups with N/P > 1 elements 
in each. 

(2) Recursively sort each group independently in 
parallel. 

(3) Merge every sorted list with every other 
sorted list. 

(4) Sort the entire list by taking the rank of 
each element to be the sum of its ranks in 
each merged list it appears in, less G-2 
times its rank in its own list. 

Noting that step (4) requires Ce) = ae) 


independent merges, we are led to the following 
recurrence relation for the time Sp(N) it takes 


this algorithm to sort N elements with P>1 pro- 
cessors. 


N NN 
S,(N) = S$, (G) + Merge 4, (6:4) 
G G(G-1) 
N N(G-1 G- 
< S,(¢) + NEU) | eM) avd 
G 
3 2P 
+ Tg3 18 18(Ge—ty) + O(1) 


Let M be the minimum of P and N. Then not count- 


ing the sequential sorting (after the final 
recursive call), the above algorithm requires 
approximately 

1gM . ;N(G-1) Nts 1) 

IgG aoe ar + 1 g(— ) + 1g1gP) , 
comparisons. This is minimized for 

as 3. PigigP - 
& #-mes( 53 nige 22 2 = ae 


In [4] we show that this algorithm can be made 
rigorous for P, N, and G all integers. Further- 
more, we show that these algorithms can be imple- 
mented on a CREW P-RAM. This yields the follow- 
ing specific results. 


lgigN 


When N=P, G = = Ig lelgN 


so 


by equation (*), 
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1.893 1lgN lglgN . 


lg lglgN mal ae 


Sort, (N) < 


o(tgigigigN ) ) 
lglgilgN ‘ 


This is an improvement the 21igN lgligN 


obtained by Valiant. 


on 


For G=2, which is the optimal choice for G 
when N > 71nd P lglgP, 
N1gN a 
Sort, (N) < > + Ted lgP lglgP 
N 2N 
20 Ge: ASP Le) 


Note that for G=2 the algorithm 
ison sort, i.e. 
sort. 


is a pure compar- 
it is no longer an enumeration 


Finally, if P = (1g nyi/* then 


Sort, (N) < 1.893k1gN + o(k1gN) 


This in an improvement on Hirschberg [3] which 


showed that ati . processors can sort in 
O(k1lgN) time, and a generalization of Preparata 
[5] which showed that NlgN processors can sort 
in O(1lgN) time. 
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ON COMPUTING WEAK TRANSITIVE CLOSURE 


IN O(LOG N) EXPECTED RANDOM PARALLEL TIME 
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Abstract -- We consider the time needed to 


compute the weak transitive closure of a Boolean 
matrix, with respect to a probabilistic model of 
unrestricted parallelism. Our principal result is 
an algorithm that computes the weak transitive 
closure of any n x n Boolean matrix in expected 
time O(log n) and in time O((log n)?) in the 
worst case. Thus, the weakly connected components 
of any directed graph on n nodes, or the connected 


components of any undirected graph on n nodes, can 


be computed within these bounds. 


1. INTRODUCTION 


We define weak transitive closure in terms of 


reflexive and transitive closure. The reflexive 


and transitive closure of ann x n Boolean matrix 
* as 
A is the Boolean matrix A = (I v A)” 7s where v 


denotes coordinate-wise disjunction. The weak 


— 


transitive closure of A is the Boolean matrix A = 
x 
(A v A‘) » where at denotes the transpose of A. 


We can regard A as the adjacency matrix of a 


directed graph and A as a presentation of the 


; 4 
graph’s weakly connected components. Our interest 


is in the parallel time needed to compute weak 


This material is based upon work supported by 
the National Science Foundation under’ Grants 
MCS80-03337 and MCS81-16678, and the Office of 
Naval Research under Contracts N00014-80-C-0221 
and N00014-82-K-0154. 
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transitive closure. 


We use a probabilistic version of the 


Parallel Random Access Machine (P-RAM), a model of 


a synchronous parallel computer introduced by 


Fortune and Wylie [3]. An advantage of the model 


is that its power can be related to the power of a 


probabilistic Turing machine. A formalization of 


this relation and a recent result of Aleliunas et 


al. [1] lead to an algorithm that, given any 


* 
symmetric n x n Boolean matrix A, computes A 


with probability of error < 1/2, in O(log n) 


time. We combine this algorithm with a 


deterministic algorithm for reflexive and 


transitive closure to obtain our main result: an 


algorithm that, given any n x n Boolean matrix A, 


- computes A with no chance of error, 


- runs in expected time O(log n), worst 
case time O((log n)*“~), and 


0(1) 


- uses n processors. 


A more refined analysis than that given below 


indicates that O(n? 1ég n) processors suffice. 


x 
Obviously, if A = A, then our algorithm 


computes reflexive and transitive closure. 


Equality holds, for instance, if A is "Eulerian". 
An n x n Boolean matrix is Eulerian if the number 
of ones in its i-th row equals the number of ones 
in its i-th column (1 < i < n). In this case, 


we can regard A as the adjacency matrix of a 


directed graph in which each strongly connected 


component is an Eulerian digraph [2]. The 
7 * 

observation that A = A if A is a_ symmetric 

Boolean matrix has bearing on the following 


problem posed by Hirschberg et al. [5]. 


Let A be an n x n symmetric Boolean matrix. 
We can regard A as the adjacency matrix of an 


undirected graph G on nodes l, 2, n. As 


defined in [5], the problem of computing the 


connected components of G is to compute an 


n-vector c, where c[i] = min{j: 1 < j < n, and 


j belongs to the same connected component as i in 


G}, for 1-< i < n. Hirschberg et al. presented 


a deterministic algorithm, with respect to a 


parallel model similar to ours, that computes c in 
O((log n)*) 


0(n7/log n) processors. 


time in the worst case, using 


* * 
Notice that c can be obtained from A =(a; ;> 
> 


by computing c{i] = min{j: 1 < j < a, a; grits 
_— = > 


for each i (1 < i ¢ n). The min computations 


can be carried out in O(log n) time using n@ 


= & 
processors. As A = A, our results provide an 


alternative method for computing c that costs 


O(log n) expected time, O((log n)*) worst case 


0(1) 


time, and n processors. 


Some discussion is in order on how our 


results contrast with recent (independent) results 


of Reif [7]. Let us review a crucial result of 


Aleliunas et al. first. In [1], Aleliunas et 
al. gave an O(log n) ~~ space-bounded, n0(1) 
time~bounded sequential algorithm to decide 
undirected graph reachability; that is, to decide 


if two distinguished nodes in an n node undirected 


graph belong to the same connected component. The 


Aleliunas algorithm is probabilistic, with 


"one-sided" error probability: If the two nodes 


belong to the same component, then the algorithm 
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decides correctly with probability > 1/23 


otherwise, the algorithm always decides correctly. 


The chance of error can be eliminated, without 


change in the space or time bounds, if we allow 


the algorithm to be non-uniform in n (Cf. [7]). 


Reif’s work adapts the Aleliunas algorithm to 


run in O(log n) parallel time, using n061) 


processors, with one-sided error probability [7]. 
(The result of Theorem 2 below is similar.) Again, 
at the cost of a non-uniform construction, the 


chance of error can be eliminated, without change 


in the complexity bounds. Although our work is 
also based on the Aleliunas algorithm, we get a 
different result: graph reachability can be 


decided for every pair of nodes in an undirected 


graph, without error and without resort to a 


non-uniform construction, in O(log n) time on 


0(1) 


average, using n processors. 


Reif defined a complexity class, 


2. CSYMLOG, in terms of O(log n) space-bounded 


* 


symmetric Turing machines [6], and established 


2, CSYMLOG is 


oct) 


that every language in 


recognizable in O(log n) time, using 


processors, with probability of error < p, where 


is constant, 0O< p < 1. (Again, a 


p any 


non-uniform construction can be used to eliminate 


the chance of error.) His analysis can be 


strengthened using our main result, to establish 


that every language in 2, CSYMLOG is 
recognizable, with no chance of error, in 
O(log n) expected time, using ars 1) processors. 


Reif showed that 2, CSYMLOG contains a number of 


interesting problems: invalidity testing of 


formulas in a restricted quantified Boolean logic, 


recognition of edges in a minimum spanning forest, 


recognition of k-connected vertex pairs in an 


undirected graph, and several graph recognition 


problems. 


2. PROBABILISTIC PARALLEL RANDOM ACCESS MACHINES 


We begin with an informal description of a 


deterministic P-RAM [3]. An infinite number of 


processors Po, Pjy,e+- are available, with each 


having a local accumulator, a local program 


counter, and an infinite local memory. The 


processors share access to two infinite global 


register sets: the read-only input registers 


In,,Iny,-6. and the work registers Wy »Woseees A 


single program controls’ execution. In this 


setting, a program is a finite list of possibly 
labeled instructions. An instruction is of one of 


the following forms. 


AC 3;= x 

x := AC 

AC := AC +x 

AC := AC - x 

goto L 

if AC=0 then goto L 
fork L 

HALT 


AC refers to the accumulator of the processor that 


executes the instruction, x is an address, an 


indirect address, or a constant, and L is a label. 


One restriction applies to the use of global 


memory. At each step, at most one processor can 


write to a register. On the other hand, any 


number of processors can read the same register 


simultaneously. 


Initially, a single processor Po is active. 


If at some step t, a processor P; executes an 


instruction of the form "fork L", then at step ttl 


a new processor Py begins execution at the 


instruction labeled L, with the accumulator of Ps 


set to the value in the accumulator of P; at step 


te Hence in t steps up to or processors can be 


activated. By convention, for i < j, if P; is 
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activated at step t at t,, thent, < t 


1 and Ps 9 1 9° 

The P-RAM terminates when Po executes a HALT 
instruction. 

As defined in [3], a deterministic P-RAM 


rs 
computes a function from {0,1} to {0,1}. We make 


a small change to allow for the computation of 


* * 
functions from {0,1} to {0,1} . We add an 


infinite global set of write-only output registers 


Out |, Out oreee, and fix input/output conventions 


as follows. An input x=x eeeX of s bits is 


2 Ss 
presented one bit per register in In,,In5,---,In 


1* 


s? 
and s is presented in processor Pos accumulator. 


All other registers have value 0. We say the 


machine outputs Y= poe eeVe of t bits, iff when Po 


halts 


1. t=max{a,0O} where a is the number in 


P's accumulator, and 


2. y,=0 iff Out,-0, for 1 <ict. 


In a deterministic P-RAM, no two instructions 


can have the same label. In a _ probabilistic 


P-RAM, a given label can be associated with at 


most two instructions. If two instructions I, and 


I, have the same label L and a processor P; 


executes a jump to L, then with probability 1/2 P. 


jumps to 1,3; otherwise (with probability 1/2) Py 


D 
jumps to I, (independently of any other step of P. 


or any other processor). 


A probabilistic P-RAM computes aé_e random 


function [8], defined as follows. A random 


function F from X to Y is characterized by a 


function pp: X x Y--> [0,1] such that, for all x 


2 
y in Y 
means F applied to x equals y with probability 


in X, Py(x,y) < 1. The expression F(x)=y 


Pp(x,y)- A probabilistic P-RAM Q is said to 
compute F iff X=Y={0,1}* and on input x, Q outputs 


y with probability Prissy), for all x in X and y 


in Y. Also, Q does not terminate with probability 
1 - 2 Pp(x,y), for all x in X. We say Q runs 
y in Y 


in expected time T(n) (in time T(n)) iff, on every 
input of n bits, Q terminates in expected time 


< T(n) (Q always terminates in < T(n) time). 


'A probabilistic Turing machine also computes 
a random function from {0,1}* to {0,1}" [4, 8]. 
Fortune and Wylie have shown that with respect to 


computing 0-1 functions, time on a deterministic 


P~RAM is at least as powerful as space on a 


deterministic Turing machine. A small 


modification of their proof gives us a useful 
connection between probabilistic Turing machines 


and probabilistic P-RAM’s. 


Lemma 1: Let F be any random function 
computable on an  S(n) ~~ space~bounded, T(n) 
time-bounded multitape probabilistic Turing 
machine, where log T(n) = O0O(S(n)) and 


S(n) > log ne Then there is a probabilistic P-RAM 


that computes F in time 0O(S(n)). 


Proof: (Outline) 


Consider a probabilistic Turing machine M 


that computes F in S(n) space and T(n) time. We 


may assume that M never enters the_- same 


configuration twice, as without more than a 


constant factor loss in time or space M can 


maintain a "clock" on a work tape. Fix an input x 


x 
of n_ bits. M can assume up to od 8(n) 


configurations on x, where d is a constant. 


We adapt a technique of [3] to construct a 
probabilistic P-RAM P that simulates M on x so as 


to compute F(x) in O(S(n)) time. Suppose that the 


function S(n) is itself computable in O(S(n)) 


9d*S(n) 


time. P forks processors, giving a 


processor P, for each configuration c; of M on x. 
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chooses a successor c, of uniformly at 


J 


the possible successors of Cy 


° Cc. 
Py 1 


random from among 
and writes j into global register Wy» for all i 


(1 ©. a 26 gA*S(n)y, In effect, this constructs a 


graph in which a given node c. 


j; corresponds to 


configuration Ce At most one edge emanates out 


of oe to Cas where j is the value of Wj 


The graph contains a directed path leading 


out of the node corresponding to the. start 


configuration, which describes an execution of M 
on input x. P uses global memory to mark each 
node on this path. As M runs in S(n) space, the 
last node on the path must correspond to a final 
configuration. 


Lastly, the output registers are 


written in accordance with the marked nodes. The 
essential details involved in implementing these 


steps in O(S(n)) time can be found in [3]. 


To carry the proof through without assuming 
that S(n) is computable in O(S(n)) time, we use 


the usual trick of 


S(n) 


large enough so that M reaches a final state in 


trying the simulation for 


2,4,8,e+- until we reach a trial value 


the simulation. In performing each trial, if any 
successor of Cy requires more than the currently 
allowed amount of space, we leave Wy undefined, 
and if Ws is defined from a previous trial, we do 
not change it. 


The latter condition is necessary 


to maintain the right probabilistic behavior. 


Otherwise the simulation would be biased towards 


choices that lead to rapid halting. 


L 


3. THE ALGORITHM FOR WEAK TRANSITIVE CLOSURE 


We begin with a theorem describing a random 
function which we refer to as random closure. For 
any symmetric Boolean matrix A, the random closure 
of A equals the reflexive and transitive closure 


of A with probability > 1/2. We will exploit 


this and other properties of random closure to 
arrive at an efficient algorithm for weak 
transitive closure. 

One more notation is of use. We write A < B 


to signify that the Boolean matrices A and B 


satisfy B=Bv A. 


Theorem 2: There is a probabilistic P-RAM P 
that computes a random function R from symmetric 
Boolean matrices to Boolean matrices such that for 


every n xX n symmetric Boolean matrix A: 


1. P computes R(A) in time O(log n). 


oe Te Pp(A,B) # O, then B is ann x n 
Boolean matrix with AC BCA. 


3. pp(A,A") > 1/2. 
We refer to R(A) as the random closure of A. 
* 
A. 


x 
The second property implies that B 


The third indicates that R(A) is likely to equal 


4 
A. 

Proof: By lemma 1, it suffices to give an 
O(log n) space-bounded, ans 1) time-bounded 
probabilistic Turing machine that computes a 


random function satisfying the second and third 


properties. We appeal to a recent’ result 


involving random walks in graphs, defined as 


follows. A random walk in an undirected graph G 
starts at an arbitrary node in G. At each step 
beginning at a node v, we choose an edge uniformly 


at random from the edges emanating out of v and 
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traverse it. Now let G be an undirected graph 


having e edges. Also let i and j be any two nodes 
in G, and let dij be the length of the shortest 
path from i to j. The analysis of Aleliunas et 
al. shows that the expected number of steps in a 
random walk in G starting at i before reaching j 


is at most 2d; 4e [1]. 


Let A be any n x n symmetric Boolean matrix 
and G the graph (having n nodes and e edges) 


corresponding to A. Using the methods in [1], we 


0(1) 


can construct an O(log n) space-bounded, n 


time-bounded probabilistic Turing machine M which 
on input i,j,A simulates step by step a random 
walk of 4(n-l)e steps starting at node i in G 
(l1 < i,j < n). If the walk reaches j then M 


outputs 1; otherwise M outputs 0. Hence if i and 
j are in the same connected component of G, M 
outputs 


1 with probability > 1/2. M always 


outputs 0 otherwise. 


To complete the proof, for 1 < i,j < na, 


compute b; j as the disjunction of ay j and the 
> > 
[Log 5 ( on*)] results of running M flog 9( on*)] 


times, on input i,j,A. 


0¢1) 


The computation requires 


O(log n) space and n time. A < B follows by 


construction. Also, B < A* holds, since by 57) is 
= bd) 


possible only if j is reachable from i in the 


graph corresponding to A. Finally, it can be 


verified that B = A with probability > 


2,42 2 
(1 - as2flogg(2n)T yn y (1 - 1/¢2n2))"" > 1/2. 
LJ 


We note that a similar argument can be used to 


obtain a random function that maps Eulerian 


matrices to Boolean matrices and satisfies the 


three properties given in the theorem. 


Now, we give our main result. 


Theorem 3: There is a probabilistic P-RAM 


that, on every n x n Boolean matrix A, computes A 
without error, runs in expected time O(log n), in 


time O((log n)*) in the worst case, and uses 


n0(1) processors. 


Proof: The desired P-RAM executes’ the 


following algorithm. R denotes random closure 


(Cf. Theorem 2). 


S := Iv (Av A‘); 
repeat 

T := R(S);3 

S := (Tv pty2 


until S = T. 
At the bottom of the repeat loop we have 
* — 
Iv (Av Ae) < T ¢ (Av AT)" = A, When 


the algorithm halts T=(T v qty? so it follows 


that T = A holds. Each iteration of the loop 


costs O(log n) time using gost? processors. 
Because of the squaring, the loop will be repeated 
at most O(log n) times, giving the desired worst 


case time bound. 


To complete the proof.it suffices to show 
that the expected number of iterations of the loop 
is 0O(1). 


By theorem 2, we have a sequence of 


trials, each with probability of success > 1/2, 
ending on the first success. The expected length 
of such a sequence is < 2. 


O 
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Abstract -- This paper considers the problem 
of performing garbage collection in a list process- 
ing system in parallel with list updating. One 
previously published method for searching an exist- 
ing list system and marking all reachable nodes is 
outlined and a new method for performing the same 
function is described. The two methods are then 
compared with respect to both the number of nodes 
that need to be visited to complete a marking 
phase, and the number of synchronisations that are 
needed when there is more than one marker working 
in parallel. A number of different list structures 
are postulated, and results are presented of the 
predicted performance of the two algorithms. 


Introduction 


Within list processing systems, nodes are 
repeatedly added to and removed from a number of 
lists. The storage locations in the memory space 
available to the list processing system tend to be 
allocated for use in a particular list and then 
freed. It is clearly desirable to reclaim these 
freed cells for subsequent use, and there are a 
number of techniques whereby this may be accomp- 
lished. The technique that is considered in this 
report is Garbage Collection which was first 
proposed by McCarthy [3] and used in the LISP 1.5 
system [4]. 


Using this technique the problem of storage 
reclamation is (often) ignored until the list of 
available cells (free list) becomes empty. When 
this arises, the list processing is temporarily 
suspended and a garbage collection process locates 
cells which have become free and adds them to the 
free list. 


The basic garbage collection algorithm falls 
into four phases:- 


1) Marking phase in which all accessible nodes 
are marked. 

2) Relocate phase in which all accessible nodes 
are compacted into a single 
contiguous area. 

3) Update phase in which all pointers to 
relocated nodes are changed. 

4) Reclaim phase in which the inaccessible cells 


are collected to form the new 
free list. 


A perfectly satisfactory garbage collection 


scheme need only consist of phases 1 and 4 and it 
is this scheme that will be considered further in 
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the remainder of the paper. 


Steele [6] suggested that garbage collection 
could be performed in parallel with list process- 
ing using two processors, one garbage collecting 
and one performing the processing operations. 
Under these conditions the user(s) would be 
spared the delay which would otherwise occur 
when the free list becomes empty. A workable 
solution to this problem, which prevented inter- 
ference between the two processors was developed 
by Dijkstra et al [1]. This solution was 
extended to incorporate multiple list processors 
(mutators) and multiple garbage collectors by 
Lamport [2]. Lamport pointed out that interfer- 
ence between the mutators and the garbage 
collectors was potentially high in the marking 
phase (phase 1) but that the reclaim phase (phase 
4) should involve negligable interference since 
the nodes being reclaimed cannot, by definition, 
be accessed by any mutator. If all the reclaimed 
nodes are gathered into an independent list then 
the only possible interaction occurs when this 
list is added to the free list and this is a 
single operation. 


In this paper, therefore, the reclaim 
phase will be ignored and a new algorithm for 
marking reachable nodes will be developed and 
compared with that devised by Lamport. 


Firstly, the terminology will be introduced, 
and the algorithm adopted by Lamport will be out- 
lined. The new solution is then presented 
together with results showing the performance of 
the two algorithms. 


Definition of Terminology 


to which consideration 
a collection of list 


The list structure 
will be given consists of 
cells (nodes). Each node consists of some 
(possibly no) data fields and an ordered sequence 
of pointers to other nodes (edges). The node 
from which an edge emanates will be called its 
source and that to which it points the destination. 
Some of the edges are distinguishable as null 
edges, that is the edge does not connect two nodes 
but acts as a terminator. 


If an edge connecting nodes A and B exists 
and B is the destination of the edge then B is 
(one of) the successors of node A and Ais a 
predecessor of B. Nodes having no successors are 
called terminal nodes (or terminals). 


Some of the nodes, known as root nodes, are 
fixed. A node is said to be reachable (or acces- 
sible) if there is a path to it from a root via 
reachable nodes. A non-reachable node is called a 
garbage node. 


Lamport's Algorithm 


Lamport introduces an extra field into the 
nodes for use during the marking phase. This 
field is intended to hold a colour which may be 
one of black, grey or white, and indicates at 
which of the stages of the marking phase the node 
is. 


Operations are introducted to change the 
colour of a node to a specific value. Also 
introduced is a shading operation which changes a 
white node to grey but leaves other colours 
unchanged. These operations on a node are required 
to be indivisible with respect to the list process- 
ing system (i.e. they must be point operations). 
The node space is divided into several (not 
necessarily disjoint) subsets. A marking process 
(marker) is assigned to each of the subsets. No 
details are given as to the method of division, so 
a physical division seems simplest. Initially, 
all nodes are marked white. 


The operation of the marking algorithm 
commences with the roots being shaded. Each 
marker then searches its subset of nodes. When a 
grey node is located by any one of the markers all 
the successors of that node are shaded and the 
original node is coloured black. All the markers 
are then requested to restart the search of their 
portion of the node space. The marking terminates 
when no grey nodes exist, i.e. all reachable nodes 
have been coloured black. The garbage. (unreach- 
able) nodes are those that remain white. 


Several comments may be made upon this 
algorithm. Firstly, no attempt is made to use the 
structure of the list within the algorithm itself. 
All reachable nodes may be located by chaining 
down the list structure from the roots. This 
leads to a second point, that all the garbage nodes 
will have to be inspected, possibly several (and 
in some cases many) times. This time is, of 
necessity, "wasted" since a garbage node, by 
definition, cannot become grey. This is an inevit- 
able consequence of dividing the node space into 
physical subsets. 


Further, the synchronisation between the 
markers is non-trivial. The need for one marker, 
on discovering a grey node and shading its 
successors, to cause all others to restart the 
search of their subspace requires a "communication 
path" between every pair of markers. Also, when a 
marker completes the search of its subspace, no 
guarantee can be given that it has completed its 
work as another marker may later discover a grey 
node. Only when all markers have completed 
searching their own subspaces can the marking 
process terminate. This requires each marker to 
monitor the state of all other markers in some way. 
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Irrespective of the method that is used to 
implement the intercommunication there is bound 
to be a considerable waste of processor time. 
This may be caused by unnecessary searching of the 


list structure, by waiting to be informed whether 


to restart the search or by both of these event- 
ualities. If a marker pre-empts all the others, 
forcing them to restart their searches as soon as 
it has marked one node, then the uncompleted 
searches are wasted. If on the other hand all 
markers are allowed to finish their search then 
any marker finding nothing has been wasting time. 
In either case it is necessary to have some way 
for markers to indicate whether they have 
completed or not so that it is possible to 
determine the end of the marking phase. If 
messages between the processors are used then 
every marker must send a message to every other 
marker when it has finished a search without 
finding a shaded node. Although the number of 
messages could be kept to a minimum some are 
bound to be sent unnecessarily. The alternative 
would be to use one shared location to record the 
state of each processor, in which case processors 
would need to loop inspecting the value of the 
locations for the other machines once they had 
finished an unsuccessful search. Any such loop- 
ing would, of course, represent wasted processor 
time. 


Chaining Algorithm 


The algorithm described above was based on 
a physical sub-division of the node space. An 
alternative algorithm is described below which 
marks the reachable nodes by searching down the 
list structure and has hence been given the name 
Chaining Algorithm. 


In order to partition the list space, and 
thus enable several markers to operate, the 
concept of a sublist is introduced. Each marker 
is allocated a section of the total list structure 
and marks the nodes contained in this sublist. 
Once a marker has a sublist, it may proceed 
independently of the other markers (thus reducing 
the synchronisation overheads). However, to 
enable marking to be equitably distributed 
between the markers an additional list is intro- 
duced. 


This list, the subroot list, holds the 
roots of unmarked sublists. Initially, it 
contains the roots of the whole structure. The 
list can be kept short, with possibly one entry 
for each marker since this list represents work 
yet to be allocated to a marker. The colour 
yellow is introduced for a node contained within 
the subroot list, so the roots of the list 
structure are initially coloured yellow. Also, 
the term "uncoloured" is introduced for a node 
which is either white or grey. 


When a marker is started, or whenever it 
has completed the marking of a sublist, it 
removes a node from the subroot list to discover 
the section of the list it is to process. This 
node is shaded. The marker then refills the 


subroot list by adding the uncoloured successors 
of the subroot it has obtained to the list until 
either the list is filled or only one uncoloured 
successor remains. The nodes added to the subroot 
list are coloured yellow. At all stages in the 
remainder of the algorithm yellow nodes are treated 
as black when encountered by a marker since the 
nodes following are guaranteed to be marked at a 
later stage. 


The remainder of the algorithm, shown in 
outline in Figure 1, is as follows. 


marker = 
begin 
while subroot list is not empty do 
remove node from subroot list; 
shade node; 
refill subroot list; 
while subroot is not black do 
while number of uncoloured 
successors = 1 do 
shade successor; 
colour node black; 
advance to successor 
setting as subroot 
od; 
while number of uncoloured 
successors >0O do 
choose one successor; 
shade sucessor; 
-advance to successor 
od; 
colour current black; 
current :=subroot 
od 
od 
end ; 
FIGURE 1: Algorithm for a Marker 


The marker maintains two pointers to the sublist 
it is processing, one to the subroot and one to 
the node it is currently inspecting. Both of 
these initially point to the root of the sublist. 
If only one uncoloured successor of the current 
node exists then the node is shaded, the current 
node is coloured black and both the subroot and 
current pointers are advanced to the successor. 
This process is repeated until a node with several 
or no uncoloured sucessors is met. If the current 
node has some uncoloured successors then one is 
chosen. It is shaded and the current pointer is 
advanced to it. This shading and advancing is 
repeated until the current node has no uncoloured 
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successors. When this situation arises, the 
current node is coloured black and the current 
pointer is set to the subroot. The whole of this 
procedure is then repeated until the subroot is 
coloured black. When that occurs the sublist for 
which the marker was responsible has been marked 
and a new root is chosen from the subroot list. 
The marker terminates when it cannot obtain a node 
from the subroot list. 


With a simply connected list structure (that 
is one containing no closed loops and no inter- 
connection between sublists), the algorithm is 
guaranteed to be correct and to terminate. The 
list structure appears as many independent lists 
each with its own marker. Furthermore, the only 
synchronisation required between the markers is 
when accessing the subroot list. The synchronisa- 
tion overhead may be kept to a minimum by allowing 
one marker to be filling the list independently of 
markers which are removing nodes from the list. 

A marker which attempts to remove a node may still 
have to wait either for another marker removing a 
node or if the list is apparently empty because it 
is in the process of being refilled. The markers 
can, however, be prevented from interfering with 
one another during the refilling stage if, when 
one marker is attempting to refill the subroot 
list then further markers are allowed to by-pass 
the refilling stage of the algorithm. 


Tf the list structure is not simply connected 
but the sublists have common nodes (but still 
without loops) then consideration must be given 
to the possible events at the intersection points. 
The simplest possibility to consider is that one 
marker colours the common node yellow or black before 
any other marker accesses that node. When another 
marker reaches that node, it will proceed no 
further. If the intersection node is white or 
grey then the structure beyond the node needs to 
be inspected and several markers may attempt to 
colour the sublist. This will have the same 
effect as several passes down the branch by a 
single marker, that is, the several markers will 
jointly colour the nodes below the intersection 
point. 


If two markers attempt to update the colour 
of the intersection node simultaneously, then one 
must complete its update after the other. The 
node then becomes that colour. Whichever colour 
is finally given to’the node, it is valid for at 
least one of the markers, and this marker will 
complete the colouring. 


However, with the algorithm as described, a 
list structure containing cycles (closed loops 
between edges) may cause a marker to permanently 
loop. To overcome this, some intelligence may be 
given to the markers. If, while chaining down 
through the successors, the marker visits an 
excessive number (e.g. more than the maximum 
height of the structure or more than the total 
number) of nodes without reaching a terminal (or 
a yellow or black node), then it may assume that 
a loop exists and arbitrarily colour the current 
node yellow and add it to the subroot list. In 


this way, a terminating condition is placed within 
the loop. Loops will then only reduce the 
efficiency of the algorithm due to wastage in 
identifying them. 


Comparison of the Marking Algorithms 


Empirical testing of the algorithms was 
carried out using a simulated multiprocessing 
system. The algorithms were used on a number of 
types of list structure. Four types of structure 
were chosen to exercise the algorithms under a 
variety of conditions. The types were:- 


a) Linear List 
b) Curtain 


This structure consists of many linear 
lists emanating from a single root. 


c) Highly Interconnected 


In this structure, each node has many 
branches with a Large number of nodes 
being shared between sublists. Two 
versions of each structure were gener- 
ated, the second being the mirror image 
of the first, that is the sublists that 
were placed left to right for a node in 
one version were placed right to left 
in the other. 


d) Random 


The interconnection was generated 
randomly. 


Each of the first threee list structures 
were used with both a high and a low proportion of 
the node space consisting of reachable nodes. AI1l 
structures were loop free. Lamport's algorithm 
was performed twice, once with the markers search- 
ing from the low addresses to high addresses and 
secondly from high addresses to low. Table 1 
shows some of the results obtained from the 
simulation studies when the node space consisted 
of 100 nodes. 


From the table it can be seen that, with one 
exception, the Chaining Algorithm performs better 
than that of Lamport on each of the values 
tabulated. In most cases, the number of nodes 
visited is vastly reduced (often by a factor of 
50 or more). Also the costs of synchronisation 
between the markers is reduced. The overall 
improvement obtained from the Chaining Algorithm 
can be observed from the elapsed times given in 
the table. 


The structure with which the Chaining 
Algorithm performs least well is one with high 
interconnectivity. Yet even with this structure 
the synchronisation overheads are minimal. This 
is of great advantage since a synchronisation will 
(in general) be much more expensive than a node 
visit. The first highly interconnected structure 
represents a 'worst' case for the Chaining 
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Algorithm as implemented. In order for the 
blackening of the nodes from the terminal nodes 
towards the subroots to take place, the sublists 
need to be traversed many times. This is partly 
due to the high interconnection which will yield 
a high degree of overlapping sublists and partly 
due to the greater number of successors which 
each node has. The pathological nature of the 
structure can be seen in the fact that the image 
structure is traversed at about one third of the 
cost. 


The effect of synchronisations is not fully 
revealed in the elapsed times recorded in Table 1, 
since a synchronisation is approximately as 
expensive as a node visit in the simulation 
program, whereas it would probably be substanti- 
ally more expensive in practice. Furthermore, in 
the results for Lamport's algorithm a synchronisa- 
tion does not cause waiting which it would do for 
some processors in some cases in practice. The 
very much higher level of synchronisation in 
Lamport's algorithm for most list structures 
could therefore be assumed to result in a further 
time advantage for the Chaining Algorithm in 
practice. 


One possible improvement to the Chaining 
Algorithm would require the introduction of back- 
ward as well as forward pointers in the List. 

The algorithm could then move back up the sublist 
from a terminal node marking as it goes. This 
would save successive searches down the list if 
there was a large fan out from the sub-node. 
Unfortunately, this structure would require a 

much more elaborate algorithm since it would be 
possible for an ascending marker to be unable to 
find the subnode from which it started due to the 
action of mutators changing the sub-tree. 

Explicit synchronisation between markers and 
mutators might now be required and the possibility 
of processing a section of the tree which has been 
rendered garbage is also highlighted. 


Conclusions 


The chaining algorithm appears to provide a 
substantially faster multiprocessor garbage 
collection system with fewer synchronisations than 
was available previously. This contention is 
being tested in practice by the implementation of 
a simple list processor system on a four processor 
machine with shared memory [5]. The results 
obtained do confirm the predictions. 


Marking Algorithm - Comparison Table 


Lamport 


Chaining 
saa Dowa 


Node W. Time 
Vstd 


Linear 
List 
Dense 


Linear 
List 
Sparse 


Curtain 
Dense 


Curtain 84 1 
Sparse 84 


High 6:03 
Inter- 3:05 
Connect 42 12:46 
Dense 45 4:17 


High 
Inter- 
Connect 
Sparse 


Random 40 1 
One 40 5 


1893 
1227 


Random : 1943 
Three 75 : 1219 


Table 1 Simulation Results 


KEY 

Type The formation of the list structure. 

G.N. The number of garbage nodes in the 
structure. 

M The number of markers employed. 

Node The number of nodes visted during the 

Vstd marking phase. 

W.P. The number of time steps during which a 
marker was waiting on the ‘listfront' 
semaphore. 

Time The elapsed time (in minutes and seconds) 
for the simulation of the marking phase. 

Syn The number of times, in total, that the 


markers were restarted at the beginning 
of their subspace. 


Notes on Table 1 
1) The simulated time for a multiprocessor 
solution represents the sum of the times 


taken by the processors. This is necessarily 
greater than the simulated time for an optimm 
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Time Node Syn Time 
Vstd 


2200 60 3:56 
2612 255 4:28 
428 21 0:47 
1201 100 2:02 
782 25 1:24 
105 4 1380 125 2:20 


for Marking Algorithms 


uniprocessor solution. The actual elapsed 
time would be rather more than one fifth of 
the total time. 


2) The simulated version of Lamport's algorithm 
permitted every marker to finish its search 
each time. This extra marking on each pass 
accounts for the lower elapsed time of some 
multiprocessor trials. 


[1] 


[2] 


[3] 
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Abstract -- Since a file is usually large and 
can not reside in primary memory, the response 
time to a query is dominated by the disk access 
time. In order to reduce the disk access time, 
and hence the response time, a file can be stored 
on several independently accessible disks. In this 
paper, we discuss the problem of allocating buck- 
ets in a file among disks such that the maximal 
disk access concurrency can be achieved. We are 
particularly concerned with the disk allocation 
problem for binary Cartesian product files, which 
have been shown to be effective for partial match 
retrieval. A heuristic allocation method is first 
proposed for the cases where the number (m) of 
available disk units is a power of 2. Then it is 
extended to fit the cases where m is not a power 
of 2. The proposed heuristic allocation method 
has a "near" strict optimal (hence optimal) per- 
formance for a partial match query in which the 
number of unspecified attributes is greater than 
a small number (5 or 8). 


1. Introduction 


In an information retrieval system, a basic 
individual unit of information is defined as a 
record and a collection of records is called a file. 
if the number of records in a file is large enough 
(cannot reside in primary memory), the whole file 
must be stored on a secondary storage device 
such as a magnetic disk unit. Therefore, we can 
also assume that the file is divided into buckets 
and each time the secondary storage device is 
accessed, a whole bucket is brought into primary 
memory. 


Since the disk access time is considerably 
jonger than the instruction execution time and 
primary memory access time, the time taken to 
respond to a query can be simply measured in 
terms of distinct disk accesses which must be 
issued. The number of distinct disk accesses that 
must be issued is equal to the number of buckets 
which contain at least one record satisfying the 
query. 

The file design problem for a particular type 
of queries can, therefore, be stated as follows : 
Given a file (a set of records), arrange records 
into buckets in such a way that the average 
number of buckets to be examined, over all 
concerned queries, is minimized. 


The response time to a given query can be 
further reduced if the file is stored on several 
independently accessible disks. Several buckets 
can be accessed at one disk access time, if they 
are on different disks and only one bucket can be 
accessed at a time if they are on the same disk. 
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The response time to a given query in this case is 
no longer proportional to the total number of 
buckets needed to be examined, but becomes 
proportional to the maximum number of buckets 
needed to be examined on a particular disk. 
Given a file designed primarily for some type of 
queries and an m-disk (m>1) system, in order to 
reduce the average response time, it is necessary 
to arrange all buckets into m disks in such a way 
that the maximal possible disk access con- 
currency is achieved when examining the 
required buckets. 


In this paper, we discuss the problem of allo- 
cating all buckets in a file, which is designed for 
partial match retrieval, to m disks. Particularly 
we concentrate on the allocation problem for 
binary Cartesian product files. In section 2, the 
relations between partial match queries and 
Cartesian product files are discussed. The exist- 
ing Disk Modulo (DM) allocation method which has 
been shown to be effective for Cartesian product 
files is reviewed in section 3 and its performance 
is shown to be poor for binary Cartesian product 
files. In section 4, we first propose a heuristic 
allocation method for binary Cartesian product 
files when the number (m) of available disk units 
is a power of 2 Then it is extended to fit the 
cases where m is not a power of 2. The perfor- 
mance of the proposed allocation method under 
various conditions is compared with that of an 
"ideal" strict optimal and Disk Modulo allocation 
methods in section 5. 


2. Partial Match Queries and Cartesian Product 
Files 


A record may consist of a single attribute or 
multiple attributes. A k-attribute record r is an 
ordered k-tuple (rj, ro, ..., t,), each rjeD, for 
1<i<k, where D; is the domain of the i-th attri- 
bute. A partial match query q for a k-attribute 
file is of the form (A,=a,,Ag=ag, ..., Ay=ay), 
where for 1<i<k, a is either a key belonging to 
Dj, the domain of the i-th attribute, or is 
unspecified (i.e., a don't care condition), in which 
case it is denoted by * and where the number of 
unspecified attributes is j where 1<j<k—1. 


A response to query q_ is a list of all records 
in which the i-th attribute is equal to a; if aj is 
not a *. We also say "a record r satisfies a given 
query q” if and only if record r is in the response 
list of query q. Recently much attention has been 
paid to the multi-attribute file design problem for 
partial match queries ({1]-[7] and [10]-[13]). In 
this paper we shall also limit ourselves to multi- 
attribute files. 


In the following we are going to define the 
Cartesian product file concept which has been 
shown to be effective for partial match queries 
[5]. A file F is called a Cartesian product file, if it 
satisfies the definition in below. 


Definition : Let Dj denote the i-th attribute 
domain of a k-attribute file and let each D, be 
partitioned into mj disjoint subsets 
Dj gq. Dj 4. Dy m;-1): We call F a Cartesian pro- 
duct file if all records in every bucket are in 
Daj XDoi9X..-XDyj , Where each Dj; is one of the 
subsets Djo: Djr. + Dj¢ 54), THe bucket b & 
Di, XD2i 4X. XDki is denobed y Liz. io... i, |. 

As an example, let D,=Do={a,b,c,d}, 
Di 9=Deg= fab} and D;,=Do9,= te,dj. Then the 
following is a Cartesian product file : 
Bucket [0,0]= D, pXDen =f(a,a),(a,b), (b,a),(b,b)3 
Bucket [0,1]= D; gXDsa1 ={(a,c),(a,d), (b,c),(b,d)3 
Bucket [1,0]= D, ; XDeg =f(c,a),(c,b), (d,a),(d,b)3 
Bucket [1,1]= D, ;XDe1 ={(c,c),(c,d), (d,e),(d,d)} 

Although in the above example all possible 
records are included, it should be noticed that a 
Cartesian product file as defined above is a subset 
of Dy XDoX...XD, and does not necessarily contain 
all possible records. The Cartesian product file 
concept also satisfies a common property of all 
"good" file design methods described by Lin, Lee 
and Du [10]: that is , records in one bucket are 
similar to one another. The Cartesian product file 
concept is a simple and natural way to cluster 
similar records into the same bucket. 


Many good file systems such as_ those 
designed by Rivest [12], Rothnie and Lozano [13], 
as well as Liou and Yao [11], are all Cartesian pro- 
duct files. Aho and Ullman [1] explored the file 
design problem of partial match queries with the 
assumption that each attribute in a partial match 
query has a probability of being specified, and 
their file structure is also a Cartesian product 
file. The main difference among these file sys- 
tems is the way each attribute domain D, is parti- 
tioned. In [5] many properties of Cartesian pro- 
duct files were discussed. 


3. Disk Modulo Allocation Method 


since Cartesian product files have been 
shown to be effective for partial match queries 
and have been widely (though sometimes impli- 
citly) described in the literature, it would be 
worthy considering the bucket allocation problem 
for Cartesian product files. Du and Sobolewski [9] 
proposed a Disk Modulo allocation method for 
Cartesian product files. 


Definition : Let file F @D,XDpX...XD, be a Carte- 
Sian product file, where each D,; is partitioned 
into m; disjoint subsets Djg,Dj1, ....Difms-1), and 
m be the number of available disk units (labeled 
as 0, 1,.., m-1). Let [i;,io ,...,4,] denote the 
bucket Fn(Daj, XDgi4X...XDyj ), Where 1<ij<m; 
for 1<j<k. In’ the Bisk Modulo (DM) allocation 
method each bucket [ij,i9,..,i,] in file F is 
assigned to disk unit (i, +igt... tn) mod m. 
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Given a file with n buckets, there are m® 
ways to allocate n buckets into m disks. The 
problem of finding an optimal allocation method 
is very difficult. Du and Sobolewski, therefore, 
compared the Disk Modulo allocation method with 
an ideal strict” optimal allocation method. 


Definition : An allocation method is said to be 
strict optimal to a query if a maximum of n/m 
buckets need to be accessed on any one of m 
independently accessible disks in order to exam- 
ine the n buckets in response to the query. If an 
allocation method is strict optimal for all possible 
queries, it is called a "strict optimal” allocation 
method. 


Note that a strict optimal allocation method 
is optimal. The Disk Modulo allocation method 
has been shown to be strict optimal for the fol- 
lowing cases [9]: 

1) all partial match queries with only one 
unspecified attribute, 

2) all partial match queries with at least one 
unspecified attribute j for which m; mod m=o, 

3) all possible partial match queries when m; mod 
m =0 or mj =1 for all 1<i<k, 

4) all possible partial match queries when m=é or 
3. 


More properties of Disk Modulo allocation 
method can be found in [9]. Although the Disk 
Modulo allocation is strict optimal for the above 
cases, the following example shows that it is not 
strict optimal (or even optimal) in general. How- 
ever, Du also proved that the Disk Modulo alloca- 
tion method is asympototic strict optimal if all 
mj, 'S increase to infinity [8]. 


Example 3.1: Table 3.1 shows the distribution of 
assigning all buckets among m disks using the DM 
method for a Cartesian product file F in which 
my =Mg=mg =e and m>7. As can be seen, the dis- 
tribution is not uniform and, in fact, disks 4 to m 
are never used. The obvious optimal allocation 
method in this case is to assign each bucket into 
a different disk. 


Generally speaking, the Disk Modulo alloca- 
tion method has a good performance (close to 
strict optimal) when each m, is either 1 or a large 
number. On the other hand, its worst case occurs 
when each mj is small but not equal to 1 (one of 
such cases is m,=e for all i). The above conclu- 
Sion is supported by the following facts : 

1) The Disk Modulo allocation is asympototic 
rel optimal when all m,’s increase to infinity 
2) let F&D, XD2X...XD, be a Cartesian product file 
with D, partitioned into mj, disjoint subsets and F"’ 
& D', XD'oX...XD', be a similar file but with a 
greater number of records (and therefore buck- 
ets) in which D'; is partitioned into m',; disjoint 
subsets where m'j>mj, and mj mod m = m'; mod 
m for 1<i<k. Du and Sobolewski showed that the 
performance of the Disk Modulo method for the 
file F’ is better than (closer to strict optimal) 


that for the file F [9]. 

3) From Table 3.1, we know the Disk Modulo allo- 
cation method has a relatively poor perfor- 
mance for large m and small m,’s. 


4. A Heuristic Allocation Method for Binary 
Cartesian Product Files 


In a Cartesian product file F if each attribute 
domain D, contains only two elements, then file F 
is a binary Cartesian product file. Since there 
are only two elements in each attribute domain, 
each attribute domain can be partitioned into 
either 1 or 2 disjoint subsets (ie., m, =1 or 2). 
Therefore, a binary Cartesian product file can be 
characterized by the following two properties : 

1) The number of attributes (k) is usually very 
large. 

2) The number of disjoint subsets partitioned 
from each D; (mj;) is either 1 or 2. 


In this section we study the allocation prob- 
lem for binary Cartesian product files due to the 
following reasons : 

1) Binary files are important. Any record type can 
be encoded as a binary string and it was pointed 
out by Rivest [12] that binary files seem to be the 
hardest type for which to design an “optimal” file 
structure, since the number of attributes is usu- 
ally large and the user has the greatest flexibility 
in specifying queries. 

2) Several papers ([3], [4] and [12]) concerning 
the file design problem for partial match queries 
are concentrated on binary files and certain 
types of binary Cartesian product files have been 
shown to be "good" file structures for partial 
match queries. 

3) Unfortunately, Disk Modulo allocation method 
has a relatively poor perforrmance for binary 
Cartesian product files. 


In the rest of this section we first consider 
the cases where m is a power of 2 and a heuristic 
allocation method which has a better perfor- 
mance for binary Cartesian product files is pro- 
posed. 


Definition ; Let F be a k-attribute binary Carte- 
sian product file (therefore, m, = 1 or 2 for 
i<i<k), and let m be the number of available 
disk units and m is a power of 2. Let T={ 
ji-Jg..J,3 be the set of all attributes i with 
mj=<. For convenience and without loss of gen- 
erality, we shall assume that j,= ifor 1<i<h. A 
heuristic allocation method (HEU) is defined as 
follows : 

Bucket [i;,ig....,i, ] (mote ij=0 or 1 for 1<i<k) is 
assigned to disk unit (us = i4j Pj) mod m, where 
Pj= g(j mod log m) for “i<j<h and pj=1 for 
ht+i<jsk. 


Example 4.1 Let F be a binary Cartesian product 
file with m,=2 for 1<i<4 and ms5=1, and m=4. 
Then logam =2) p, =2(1 mod 2) =2, 
po =2lF mod 2) =1, 3=2(3 mod 2) =o, 
pa =2i4 mod 2) =1, and p5=1 (since m5 =1). 


Table 4.1 shows the distribution of all buck- 
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ets in the above example. Note that all 16 buckets 
are uniformly distributed among 4 disks. Let 5 be 
the set of all attributes j with m,=1 and 5;, for 
O<i< logom , be the set of all attributes ] with i=j 
mod logom and jeT (ie., m;=2). In the previous 
example S={5%, Sg={2,4} and 5S, ={1,3}. By the 
definitions of S and S; for O<i<logem, it is not 
hard to see that in the heuristic allocation 
method p,=1 for each jeS and pj=2' for each 
jeS,;, where 0<i<logom. 

There are two interesting properties depicted 
in Example 4.1: 
1) For each bucket [ij.i2....i,], O<ij<mj for 
1<j<k. Therefore, ij=O for each jes (since 
mj =1) and nk, =}j (ij .pi) mod m= ae =} igs) Pj) 
mod m. That means the value of Pj for each jes 
is of no importance. Since m,=1 for each je5, the 
number of buckets needed to be examined to 
respond to a query will be the same no matter 
the j-th attribute is specified or not. Further- 
more, let FED, XD5X...XD, be a binary k-attribute 
Cartesian product file and each D, is partitioned 
into m; subsets for 1<i<k with mj;=1 for some J. 
For simplicity, let us assume jJ=1. Let F’ Cc 
Do XD3X...XD, be a (k-1)-attribute Cartesian pro- 
duct file and each D, is partitioned into mj sub- 
sets for 2<i<k. If bucket [i,.ig,....i, | in file F is 
assigned to disk d by the proposed allocation 
method, then bucket [ig.....i,] in file F’ is also 
assigned to disk d (since m;=1 and i,=0). Given 
a query q=(A; =a; .Ag=a.....A, =a) in file F, there 
is a query q’ =(Ao=ag,....A,=a,) in file F’ 
corresponding to query q and the responses to 
query q and q’ contain the same number of buck- 
ets. If bucket [i,.i9,...,i,] in file F is in the 
response to query q, then bucket [ig.... i, ] in file 
F’ is in the response to query q’. Thus, from the 
performance point of view there is not much 
difference between file F and F’. For simplicity in 
the rest of this paper we are going to assume that 
S is empty (i.e., m,;=e for all i) for each binary 
Cartesian product fie. 
2) For a given partial match query which contains 
at least two unspecified attributes, one belongs to 
Sq and the other belongs to 5,, then the heuristic 
allocation method is strict optimal for the query. 
For instance, in Example 4.1 queries 
q=(A, =* Ao=*, A3 =ag,A4=a,,A5=a5), where aj & Dj 
for 3<i<5, and queries q’=(A,=*,Ag=ao, 
Ag =*,A,=*,A5=a5), where ageDo and aseDs5 are 
strict optimal. The readers can verify this by 
themselves. 


Before giving a theorem which shows that the 
heuristic allocation method is strict optimal 
under certain conditions, let us define some nota- 
tions first. Let <b, ,bg,...,b, > denote the set of 
buckets [ij ,ig,....i,] with ij=bj; if bj is not a * or 
O<ij=mj,-1 if bj; is *, where b; is either * or 
O0<b;=m,-1 for leick. For example <*,0,1,*> = 
[0,0,1,0], [0,0,1,1], [1,0,1,0], [1,0,1,1]} if 
m,=m4=<. <b;,bo ,...,b,> is also the response 
list of a partial match query q= (A, =a,.Ap=a9 
A, =a,), where aj=* if bh=* or ajeDjp, if b, is 
not a * for isi<k. It is not hard to see 


<byee biog Mg ge DR> = YMC TE 9 <by 


qere dj — 1 obs bj + 1 gree by >. 

Let G= {<b bo,.., bd, > | <b, ,be,... by > has 
exactly loggm unspecified attributes, one from 
each S,;, where O<i<logom {. The following 
lernma shows that all buckets in each element of 
G are "uniformly" distributed among m disks. 


Lemma 4.1 Assume that m=25 for some non- 
negative integer h and <bj,bo,..,b,>eG. All 
buckets in <bj,bo,...,b,; > are “uniformly” distri- 
buted among m disks, one in each disk, by the 
proposed heuristic allocation method (Due to the 
space limitation, all proofs in this paper are omit- 
ted.). 


For instance, <*,*,0,0,0> is the response list 
to a query q=(Aj =*,Ag=*,Ag =ag,A4 =a4,A5 =a5), 
where ajeDjg for 3<i<5, in Example 4.1. Since 
m=2°=4 and 1¢5,, and ZESg, <*,*,0,0,0> eG. All 
four buckets [0,0,0,0,0], [0,1,0,0,0], [1,0,0,0,0] and 
[1,1,0,0,0] in <*,*,0,0,0> are assigned to disk 0,1,2 
and 3 respectively. 


Theorem 4.1 Let m=2) and q be a partial match 
query containing at least one unspecified attri- 
bute from each 8; for 0<i<loggm=h. Then the 
heuristic allocation method is strict optimal for 
query q. 

Assume m=2". By definition of the heuristic 
allocation method if m,=2 for 1<i<k and k is a 
multiple of h, then there are exactly k/h attri- 
butes belonging to each S; for O<i<h. 


Corollary 4.1 Let mj=2 for 1<i<k and m=24, and 
k is a multiple of h. The heuristic allocation 
method is strict optimal for all partial match 
queries which contain more than (h-1).k/h 
unspecified attributes. 


The above theorem and corollary are quite 
useful in practice. Let us consider a special case 
m= 2°=4 and k=20 (there are about a million pos- 
sible records in the file). A query being asked 
usually has less than 10 attributes being specified 
(most likely no more than 5). Thus by the Corol- 
lary 4.1 the heuristic allocation method is strict 
optimal for this query. 

In an information retrieval system, usually 
some number of attributes will never or have a 
very little chance being specified in a query. If 
there are logom=h attributes never being 
specified, then we can assign one of such attri- 
butes to each 8; for 0<i<h and the heuristic allo- 
cation method is becoming strict optimal for all 
possible partial match queries (with those attri- 
butes unspecified). Even if the number of such 
attributes is less than logom, the performance of 
the heuristic allocation method can be improved 
_by assigning one of such attributes to each 5, and 
assigning the rest of attributes to those 3\'s 
which contain no such attributes. 


In the above allocation method, bucket 
[iy .ig,....i,] is assigned to disk (2 = 43; .p;) mod 
m, where pj=2 j mod log m) if mj=2 for f<j<k. 
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m =e for isi<k, 


We can extend Be B0OVE po eee method by 
replaci p=e\) Bree. 20e at ith either 
eters miiod slog! my) or pj= all mod Nog m’) to fit 
the cases where m is not a power of 2. The 
extended allocation method has almost the same 
performance as the original one. This will be 
shown in the next section. | 


5. Analysis and Comparisons 


In this section the performance of the pro- 
posed allocation method under various conditions 
is compared with those of Disk Modulo and an 
“ideal” strict optimal allocation methods. 


Let FED, XDoX...XD, be a binary Cartesian 
product file. We shall still assume that mj=< for 
isi<k. Therefore, each D; is partitioned into 2 
single element subsets Djg and D;,;. The number 
of queries with j unspecified attributes equals to 
(C(k,j).2*-J), where C(k,j) is the number of ways 
to choose j elements from a pool with k elements. 

Let q=(A;=a,.....A,=a,) be a partial match 
query with j unspecified attributes i, Jic,...,i; and 
<c;,...,.c,> be the response to query q, where 
ej=aj=* if ie fi; ig,....ij) or ae Dj. if a#*. Since 

there are 2) buckets in 
<c1,09,...,0,>. Let Ax(q) = (ng.ny.....M%—-71) be 
an m-tuple with nj denotes the number of buck- 
ets in <cj,¢g,.,.,¢,> which are assigned to disk i 
by allocation method K. Thus, Apy(q) and 
Ayry(q) denote the distributions of all buckets in 
the response to query g among m disks when Disk 
Modulo and the proposed heuristic allocation 
methods are applied respectively. 

Let N=(ng,n1,....0_,—1) be an m-tuple with n, 
being non-negative integer. We define N i) to be 
the m-tuple formed by a right circular shift of i 
positions of all m eoriponents of N. For example, 
if N=(1,2,3,4), then NC!) =(4,1,2,3), NC@) =(3,4,1,2), 
n(3) =(2,3,4,1) and N(4) =(1,2,3,4) =N. Let us also 
define m functions fj(N) = N+N(i) for 1<i<m. 
Also for convenience, let fj, j5,.. ig(N) denote 
fi, (fi, (--. (f;,(N)...). The following tttorem shows 
thal given a query q how to compute Apy(q) and 


Angu (4). 


Theorem 5.1 Let q=(A; =a;,Ao=ao,....A,=a,) be a 
partial match query with j unspecified attributes 
iy.ig... dj and <c;,¢o,...,c, > be the response to 
query q. Let t= (Uogyxc;) mod m and t’= 
( ZS -cxecj-pj) mod m, where pj=2" if ieS, as 
defined in the heuristic allocation method. Let N 
and N’ be an m-tuple with all components to be 0 
except the t-th and t’-th component to be 1 
respectively. Then A (q)= f pa, ON) 
and Any (q)= fy. 1 tN), where thers arady 1’s 
in the expression. 

For example, let 'm,;=2 for 1<i<5 and m=4. 
Since Sg=(2,4} and S,={1,3,5} (recall the 
definitions for Sg and S; in the previous section), 
Pi =Pg=P5 =e and po=py=1. Let q=(A, =*,Ao=*, 
Ag =*,A,=a,,A5=a5), where ageDyg and ageD53. 
Then t=1, t’=2, Apw (a) = fi 1,1(N=(1,0,0,0)) 
(1,3;3;1) and Ayru(a)=f2,14.2 (N’=(0,1,0,0)))) 
(2,2,2,2). 


Let Ag(q) = (np.ny....m-1) Where nj 
denotes the number of buckets, among those 
needed to be examined to respond to query q, 
being allocated to disk i by allocation method K. 
Thus, the time required to respond to query q is 
max {ng.j.....O—13- 

Now let us consider the performance of the 
proposed heuristic allocation method under vari- 
ous conditions. First assume m=2". Given a par- 
tial match query q, from Theorem 4.1, if there 
exists at least one unspecified attribute from 
each 8, for 0<i<h, then the proposed allocation 
method is strict optimal for query q. Let k be a 
multiple of h. Thus, the number (b) of elements in 
each S, for O<i<h is the same (i.e., b= k/h). 


Assume that there are j unspecified attri- 
butes in a query q. Let Prob(i,j) denote the pro- 
bability of having all j unspecified attributes in i 
out of h 5;.’s, where 0<r<h. 

Then Prob(1,j) =(C(h,1).C(b,j)) / C(k,j) if jeb or 0 
if j > b. 

Prob(2,j) = ((G(h,2) .C(2b,j)) / C(k,j)) - Prob(1,)j) 
if js2b or Oifj > eb. 

In general Prob(i,j) = ((C(h,i).C(i.b, j)) 7 C(k,j)) 
-yi~1_, Prob(r,j) if j<(ib) or 0 if j>(ib). 

Note t-1,_,Prob(r.j)= (C(h,i-1). C((i-1).b,j)) 
/C(k,j) if j<((i-1).b) or O if j > ((i-1).b). 

Thus, the probability for the proposed allocation 
method to be strict optimal for query qis 
Prob(h,j) =1- (C(h,h-1). C((h-1).b,j)) /C(k,j) if 
js((h-1).b) or L ifj > ((h-1).b). 


For example, let k=8 and m=4. Given a par- 
tial match query q with 4 unspecified attributes, 
the probability of the proposed allocation method 
to be strict optimal for query q equals to 
Prob(h=2,j=4) =1- (C(2,1).C(4,4)) /C(8,4) =34/35¢y 
0.9714. 

Tables 5.1, 5.2 and 5.3 show the comparisons 
of the performance of an “ideal” strict optimal, 
the proposed heuristic and Disk Modulo allocation 
methods when m is a power of 2. Let SO, HEU and 
DM represent a strict optimal, the proposed 
heuristic and Disk Modulo allocation methods 
respectively. Let Tk and RE, denote the average 
response time and the relative efficiency, respec- 
tively, of a concerned partial match query when 
allocation method K is employed. RE, is defined 
as (100. Ts9) / Tx, which shows the degree of 
closeness of the performance of allocation 
method K to that of an "ideal" strict optimal allo- 
cation method. Note it is different from the pro- 
bability of allocation method K to be strict 
optimal for a given partial match query. Also 
note that the case where all k attributes are 
unspecified is considered in Tables 5.1, 5.2 and 
5.3, although there is no such partial match 
query according to our definition. 


In comparing these results, the following can 
be concluded : 
1) The proposed heuristic allocation method has a 
better performance than Disk Modulo allocation 
method in all cases. 
2) For fixed m and k, the performance of the pro- 
posed heuristic allocation method for a query q 
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with j unspecified attributes will first degrade as } 
increases. However, it will improve rapidly after j 
is greater than logom. In fact it becomes strict 
optimal when the number of unspecified attri- 
butes is greater than'(logom -1).k /logs m. 

3) For a fixed k but different m, the average 
response time to a query q with j unspecified 
attributes is almost the same when Disk Modulo 
allocation method is applied. That means the 
relative efficiency of Disk Modulo allocation 
method will decrease as the number of available 
disk units increases. Fortunately this is not true 
for the proposed heuristic allocation method. 

4) The proposed allocation method has a "near" 
strict optimal performance for a query q in which 
the number (j) of unspecified attributes is 
greater than a small number (5 or 6). 


Since the average response time to a query q 
is proportional to the number of unspecified 
attributes in q, this last point is very important. 
For instance, in Table 5.2 (b) REypy =54.90 for j=3 
and the difference between Typy and Tso is less 
than one disk access. However, in the same table 
REpy —63.64 for j=15 and the difference between 
Tpyw and Tso is more than two thousand disk 
accesses. 


When m is not a power of 2, the proposed 
allocation method can be extended by replacing 
ja? j mod log m with . 4 either 
p;=2' j moddog my) or p,=2(j mod'log m’), Let 
Ui and HEUe2 denote the former and the latter 
modified heuristic allocation methods respec- 
tively. Tables 5.4 and 5.5 show the comparisons 
of the performance of HEU1, a strict optimal and 
Disk Modulo allocation methods for m=5 and 6. 
Similar comparisons are shown in Tables 5.6 and 
9.7 for HEUc when m=6 and 7. Although those 
nice results in the previous section cannot be 
applied to HEU1 and HEU2 any more, their per- 
formance is still fairly close to strict optimal 
when j is greater than 5 or 6. Some more com- 
parisons of the performance of a strict optimal, 
HEU1, HEU2 and Disk Modulo allocation methods 
are shown in Table 5.8. The results in Table 5.8 
are derived under the assumption that each 
query has an equal probability being asked. When 
m is a power of 2, both HEU1 and HEU@ allocation 
methods become HEU allocation method. It is not 
hard to see that the performance of HEU1 and 
HEUVe allocation methods for the cases where m is 


not a power of 2 is not inferior to that of HEU 


allocation method for the cases where m is a 
power of 2. Although it can not guarantee the 
best result, when m is not a power of 2 the sug- 
gested criterion to choose either HEU1 or HEU2 
allocation methods depends on the difference 
between logom-,logom, and Nogem! -logom. If 
loggm-—,loggm, < ‘logom'-loggm then choose 
HEU1 otherwise choose HEU2@. 


6. Summary 


If a file is large and can not reside in primary 
memory, it is stored on a secondary storage 


access concurrency, all buckets in a file need to 
be carefully allocated to a multiple disk system. 


In this paper we are concerned with the disk 
allocation problem for partial match retrieval. 
Any record type can be encoded as a binary 
string and binary files are probably the hardest 
type for which to design a good file structure. 
Also, it has been shown that Cartesian product 
files are effective for partial match queries. 
Therefore, we particularly concentrate on the 
allocation problem for binary Cartesian product 
files. 


since the performance of the existing Disk 
Modulo allocation method is first shown to be 
poor for binary Cartesian product files, a heuris- 
tic allocation method for binary Cartesian pro- 
duct files when the number of disks is a power of 
e is first proposed. Then the proposed heuristic 
allocation method is extended to fit the cases 
where m is not a power of 2. The proposed 
heuristic allocation method is shown to be “near” 
strict optimal for a partial match query in which 
the number of unspecified attributes is greater 
than a small number (5 or 6). A systematic way 
to compute the response time to a given query 
when the proposed heuristic and Disk Modulo 
allocation methods are employed is also given. 
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Bucket Assigned Disk Bucket Assigned Disk | 


2] 2 

18] 3 

O] @ | 
2) 1 
of sé | 
18] Q | 
8) 1 
8] 2 | 


Disk # # of Buckets Assigned | 
to That Disk 
etalhee ee mes 
1 g 4 i 
: 1 4 5 
2 4 
3 4 


Table 5.1 (a) Table 5.2 (a) 


M=8 and K=8 
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M=4 and K=8 
# unspecified T T RE T RE 
# unspecified Ty, Tae RE Gai ee RE attributes SO HEU HEU DM DM 
attributes DM 
1 1 1.9206 198.20 1 100.08 
1 1 1.9000 198.90 1 120.28 2 1 1.2589 85.28 2 50.28 
2 1 1.4286 70.20 2 58.28 3 1 1.7321 57.73 3 33.33 
3 2 2.2143 9G. 32 3 66.67 4 2 2.6857 74.47 6 33.33 
4 4 4.0857 97.90 6 66.67 5 4 4.5357 88.19 19 49.00 
5 8 8 .gaa0 109.90 18 8B .28 6 8 8.2857 96.55 22 40.08 
6 16 16.2002 199.20 20 8D. 20 7 16 16.2290 108.22 35 45.71 
7 32 32.9000 120.20 36 88.89 8 32 32.9990 120.29 70 45.71 
8 64 64.9000 122 .22 72 88.89 
Table 5.2 (b) 
Table 5.1 (b) 
=8 and K=16 
M=4 and K=16 
# unspecified T T RE T RE 
# unspeci fied Teo THEU RE LEY To REpw attributes eo HEY HEU pM DM 
attributes 
1 1 1.0880 120.22 1 198.09 
1 1 1. 2228 128.22 1 198.99 2 i 1.2917 77.42 2 59.090 
2 l 1.4667 68.18 2 50.298 3 1 1.8214 54.98 3 33.33 
3 2 2.3000 86 .96 3 66.67 4 2 2.8791 69.47 6 33.33 
4 4 4.2308 94.55 6 66.67 5 4 4.8571 82.35 10 40.06 
5 8 8.1282 98.42 12 80.90 6 8 8.8369 99.53 20 40.00 
6 16 16.9699 99.56 20 80 . 22 7 16 16.7334 95.62 35 45.71 
7 32 32.0252 99.92 36 88.89 8 32 32.5952 98.17 78 45.71 
8 64 64.0056 99.99 72 88.89 9 64 64.4073 99.37 126 58.79 
9 128 128.220 102.028 136 94.12 19 128 128.2158 99.83 252 50.79 
12 256 256.2200 190.98 272 94.12 ll 256 256.0678 99.97 462 55.41 
11 512 512.2000 100.98 528 96.97 12 512 512.0206 129.20 924 55.4) 
12 1924 1924. 9090 122.08 1256 96 .97 13 1924 1924.6060 100.29 1716 59.67 
13 2848 2848 . 20960 128.28 2980 98.46 14 2948 2948. 9000 100.8 3432 59.67 
14 4996 4096 . BOBO 120.22 4160 98.46 15 4596 4996 . 0080 108.26 6436 63.64 
15 8192 8192.9000 100.28 8256 99.22 16 8192 8192 .G000 100.68 12872 63.64 
16 16384 16384.9090 190.88 16512 99 .22 
Table 5.4 
= = =9J mod,log m 
Table 5.3 (a) M=5, K=16 and Ps 2 Le A 
M=16 and K=8 # peda area Tso THEUL REVEUL To REom 
fn pe CLE AOe, “lag THEU REveu pM RE py 1 1 1.2980 190.98 1 100.90 
attributes 2 1 1.4667 68.18 2 59.96 
1 1 1.9090 129.22 4 129.90 ; ; viens aaa : ope 
2 1 1.1429 87.80 2 55 - 86 5 7 7.3333 95.45 19 18.20 
3 1 1.4286 78.80 3 33.33 6 13 13.7622 94.46 20 65.09 
4 1 1.9429 51.47 6 26267 7 26 26.4434 98.32 35 74.29 
5 2 2.9571 78 . 08 1 San 8 52 52.2962 99.43 78 74.29 
6 As 4.5714 Bree 22 29.0 9 193 123.4970 99.61 127 81.10 
7 8 8 . 8000 186.99 35 22.86 19 205 205.7622. 99.63 254 60.71 
8 16 16.9289 188.00 ie 22.86 11 418 419.3590 99.91 474 86.50 
12 829 820.1538 99.98 948 86.58 
13 1639 1639.29a9 99.99 1807 90.70 
14 3277 3277-4667 99.99 3614 90.68 
15 6554 6554.0000 128.88 6995 93.76 
iia S945) 16 13198 13198.9000 198.88 13999 93.70 
M=16 and K=16 Table 5.5 
# unspecified T., THEU RE VEU Tom RE om M=6, K=16 and P.=2) Mod log my 
attributes j 
1 i 1. G22 190.09 1 199.22 i pinnae Ts9 THEUL REVEuL Tom RE pw 
2 1 1.2998 83.33 2 59.22 
3 1 1.5786 63.35 3 See 1 1.9980 1.0989 190.20 1.0000 1260 
4 1 2.2648 44.15 6 Tee 2 1.2889 1.4667 68.18 2.8980 50 
5 2 3.4725 57.59 19 pen? 3 2 .GaaG 2.2080 99.98 3. G80 66 
6 4 5 +6833 78.38 29 ae 4 3. B80 3.99877 76.77 6 . BBO 59 
7 8 9.8084 81.56 35 22.8 5 6.2209 6.7328 89.14 18.9200 60 
8 16 17.8228 Soar! Hos Seas 6 11.9900 12.6643 86.86 20.9909 55 
9 32 33.6608 95.07 126 eee 7 22.0200 939566 92.61 35.9900 62 
18 64 65.3127 97.99 aoe sei 8 43.0000 46.9398 93.46 709.8000 61 
11 128 128.8132 99.37 aoe 27. 9 86.0200 89.5622 96.82 126.0800 68. 
12 256 256.3121 99.88 ao ae 10 171.0008 176.3846 96.95 252.0800 64: 
13 512 512.8800 ECG =001 -LET6 celeb 11 342.0008 348.5769 98.11 463.9000 73 
14 1924 1924.6096 199.08 3432 29.84 12 683.0008 692.2154 08.67 926.9800 73. 
15 Fane ta a a Se ee 13. 1366.9808 1377.4900 99.17 1730.8g00 78. 
16 4996 + =4096 . BBO 106.08 1 14 2731-0000 2746.3333 99.44 3460.9000 78 
15 5462.0000 5481.5900 99.64 6555 .2000 83.33 
16 18923.9098 19959.9000 99.75 13110.g90¢ 83.32 


Table 5.6 
=24 fmod log m! 


M=6, K=16 and P, 
# unspecified Tso THEU2 REL eu2 Tom RED 
attributes 
1 1.2290 1.2220 108.9 1.0289 196.00 
2 1.2820 1.5417 64.86 2. B22 59.28 
3 2.2822 2.3125 86.49 3.2000 66.67 
4 3.228 3.9148 76.63 6.000 59.98 
5 6.9000 6.6809 89.81 18.2092 69.80 
6 11.9009 12.3613 88.99 20.2828 55.00 
7 22.0000 23.0577 95.41 35.009 62.86 
8 43.2280 44.7051 96.19 78. B08B 61.43 
9 86 . B2BOB 87.3317 98.48 126.9920 68.25 
12 171.8288 172.9492 98.87 252.9200 67.86 
11 342.0000 343.6344 99.52 463.0990 73.87 
12 683.0000 685.4478 99.64 926.9200 73.76 
13 1366-2220 1368.3571 99.83 1730.9000 78.96 
14 2731.8008 2734.2917 99.88 3460.9000 78.93 
15 5462.0008 5465.3752 99.94 6555.9000 83.33 
16 19923 .2828 19927.0000 99.96 13119.8992 83.32 
Table 5.7 
; = 
M=7, Kel6 and p,=2! "Od log m 
po unepecs feed" (S60 TyEU2 REHEU2 TDM REDM 
attributes 
1 1.2090 1.998¢ 188.96 1.8000 190.00 
2 1.2920 1.2917 77.42 2.0000 59.08 
3 2 .BBBO 2.0714 96.55 3.9200 66.67 
4 3.20008 3.3764 88.85 6 .GG5 50.00 
5 5. BIBS 5.6891 87.89 19.9089 59.00 
6 19.9990 19.4472 95.72 28.8026 50.90 
7 19.2000 19.6892 96.58 35.9800 54.29 
8 37.2080 37.9499 97.52 70. B980 52.86 
9 74.0990 74.6866 99.19 126.9990 58.73 
19 147.9008 147.7451 99.58 252.0000 58.33 
il 293.0000 293.8720 99.78 462.0000 63.42 
12 586.0000 586.4203 99.93 924.9990 63.42 
13 1171.0000 1171.4821 99.96 1717.9000 68.20 
14 2341.0000 2341.4583 99.98 3434.95906 68.17 
15 4682 .2080 4682 .0000 120.06 6451 .9200 72.58 
16 9363.2000 9363.8000 120.80 12902 .0000 72.57 
Table 5.8 
SM Tso Tyev. RFyeui THev2 F¥yeva Tom = RE 
8 4 2.6599 2.8579 93.27 2.8579 93.87 3.8071 69.87 
8 5 2.5283 2.7187 92.98 2.7787 90.97 3.8046 66.24 
8 6 2.2259 2.6053 85.44 2.6967 82.54 3.8046 58.51 
a7 2.1295 2.6841 81.77 2.2938 92.84 3.8046 55.97 
8 8 1.5533 1.9975 77.76 1.9975 77.76 3.8846 49.83 
8 9 1.5587 1.9676 78.81 2.0409 76.80 3.8046 40.76 
8 12 1.5279 1.9619 77.88 2.0338 75.16 3.8846 40.16 
8 11 1.4365 1.9478 73.78 1.8898 76.05 3.8046 37.76 
8 12 1.4340 1.9321 74.22 1.9543 73.38 3.8246 37.69 
8 13 1.4137 1.9321 73.17 1.7659 80.96 3.8046 37.16 
814 1.4137 1.9321 73.17 1.7424 81.14 3.8046 37.16 
815 1.4111 1.9321 73.94 1.6421 85.94 3.8046 37.09 
816 1.1421 1.5431 74.91 1.5431 74.91 3.8046 30.02 
16 4 24.9876 25.1243 99.45 25.1243 99.45 28.3653 88.99 
16 5 28.4862 20.8778 98.12 21.1354 96.93 27.2781 75.12 
16 6 17.1542 18.5189 92.67 18.1231 94.65 27.1515 63.18 
16 7 14.9412 17.7862 84.00 15.4550 96.68 27.1451 55.94 
16 8 12.5225 13.2763 94.32 13.2763 94.32 27.1456 46.13 
16 9 11.6071 12.4597 93.16 12.9293 89.77 27.1450 42.76 
1618 16.4946 11.7945 88.98 12.2545 85.64 27.1456 38.66 
16 11 9.4351 11.4304 82.54 11.9999 85.01 27.1450 34.76 
16 12 8.8328 11.1344 79.33 19.5928 83.39 27.1450 32.54 
16 13 8.1901 11.9870 73.86 9.6664 83.88 27.1450. 29.84 
16 14 7.8044 11.9582 70.58 9.1928 85.74 27.1450 28.75 
16 15 7.4133 11.8512 67.08 8.4984 88.17 27.1450 27.31 
1616 6.3435 7.7227 82.14 7.7227 82.14 27.1450 23.37 
24 4 246.1701 249.2461 99.97 249.2461 99.97 258.6822 96.32 
24 5 199.8357 200.4949 99.67 201.8313 99.01 231.7746 86.22 
24 6 166.6132 179.4278 97.76 168.5526 98.85 225.9612 74.03 
24 7 143.8493 154.5536 92.56 144.0367 99.31 223.9369 63.88 
24 8 124.5875 125.4996 99.27 125.4996 99.27 223.8135 55.67 
24 9 111.2498 113-8549 98.49 114.9421 96.79 223.8050 49.71 
24.18 100.1676 193.3942 96.88 105.3645 95.97 223.8046 44.76 
2411 91.1142 96.5833 94.34 95.4793 95.43 223.8046 46.71 
2412 83.5570 91.7665 91.85 88.3628 94.56 223.8646 37.33 
2413 77.0474 89.2935 86.29 80.9989 95.12 223.8046 34.43 
2414 71.8581 87.7426 81.99 75.3481 95.37 223.8046 32.11 
24.15 67.1951 86.9325 77.38 69.8275 996.23 223.8046 30.02 
2416 62.3836 65.9619 95.76 65.9619 95.76 223.8046 27.84 
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ALGORITHMS FOR REPLACE-ADD BASED PARACOMPUTERS 


Clyde P. Kruskal 
Department of Computer Science 
University of Illinois 
Urbana, Illinois 61801 


I. Introduction 

Several groups are designing large-scale mul- 
tiprocessors to take advantage of inexpensive, 
fast floating-point processors which will soon be 
available. One such project: is the "NYU Ultracom- 
puter" [3] for which much effort has gone into 
designing operating system algorithms [5], [10] 
and designing and implementing numerical algo- 
rithms (e.g. [6]). In this paper we present and 
analyze algorithms for solving nonnumerical prob- 
lems on an idealized model of the Ultracomputer -- 
a "replace-add-based paracomputer". 


A replace-add-based paracomputer is essentially 
a traditional shared memory machine augmented with 
an extra primitive -- the "replace-add". By exhi- 
biting algorithms that make use of the replace-add 
to be faster than any algorithm for solving the 
same problem on aé_ traditional shared memory 
machine, we show that this primitive enhances the 
model. 


(N.B. The current Ultracomputer design is 
based on the "fetch-and-add" operation [4] rather 
than the replace-add. However, these two primi- 
tives are essentially equivalent, and all of our 
algorithms can be easily transferred to the newer 
model.) 


If. The Paracomputer Model of Computation 

An idealized parallel processor, dubbed a para- 
computer by Schwartz [11] and classified as a WRAM 
by Borodin and Hopcroft [2], consists of auto- 
nomous processing elements (PEs) sharing a central 
memory. The model permits every PE to read or 
write a shared memory cell in one cycle. In par- 
ticular, simultaneous reads and writes directed at 
the same memory cell are effected in a single 
cycle. 


We augment the paracomputer model with the 
"replace-add" operation (described below) and make 
precise the effect of simultaneous access to the 
shared memory. To accomplish the latter we define 
the serialization principle: The effect of simul- 
taneous actions by the PEs is as if the actions 
occurred in some (unspecified) serial order. For 
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Mathematical Sciences Program of the U.S. Depart- 
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76ERO3077, and in part by the National Science 
Foundation under Grant Nos. NSF-MCS/79-21258 and 
NSF-MCS81-05896. 
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example, consider the effect of one load and two 
stores simultaneously directed at the same memory 
cell: The cell will come to contain some one of 
the quantities written into it. The load will 
return either the original value or one of the 
stored values, possibly different from the value 
the cell comes to contain. Note that simultaneous 
memory updates are in fact accomplished in one 
cycle; the serialization principle speaks only of 
the effect of simultaneous actions and not of 
their implementation. 


The Replace-Add Operation 


We now describe a simple yet very effective 
interprocessor synchronization operation, called 
replace-add, which takes two parameters C and E. 
This indivisible operation is defined to increment 
the value in cell C by the integer E and also 
return this sum to the executing PE. Moreover, 
replace-add must satisfy the serialization princi- 
ple stated above: If C is a shared cell and many 
replace-add operations simultaneously address C, 
the effect of these operations is exactly what it 
would be if they occurred in some (unspecified) 
serial order, i.e. C is modified by the appropri- 
ate total increment and each operation yields the 
intermediate value of C corresponding to its posi- 
tion in this order. 


The following example illustrates the semantics 
of replace-add: Assume during some cycle PE, exe- 
cutes 


replace-add(C,E, ) ; 


PE; 


j executes 


replace-add(C,E,) ; 


and no other operations are performed on C. Furth- 
ermore, let V be the value in C at the start of 
the cycle. Then, at the end of the cycle, C will 
contain V + &E, +E. and, depending on the serial 


either PEs and PE. 


j will receive 


order effected, 
the values 


J 
respectively, or they will receive the values 

V+ &E; +E; and Vt E; 
respectively. 


We stress that paracomputers, especially when 
augmented with the replace-add, must be regarded 
as idealized computational models since physical 


limitations such as restricted fan-in prevent 
their realization. However the "Ultracomputer 
group" at the Courant Institute of New York 


University is designing a parallel processor that 
approximates such a machine (see [3] for a 


description of the architecture). A crucial 
aspect of the design is that multiple accesses to 
the same location (including replace-adds) are 
accomplished in approximately the same time as a 
single access to a location. 


IIT. Algorithms 


This section contains paracomputer algorithms 


for solving a wide class of problems. We con- 
sistently use N for the problem size, use P for 
the processor size, and denote the PEs as 


PE +++sPE,_i- Some of the algorithms assume, for 


the sake of clarity, that P divides N (i.e. 
N = LP); they are easily generalized for P not 
dividing N. We use the order notations 0, Q, 0, 
and o as defined by Knuth [7]. The base of all 
logs can be assumed to be two unless otherwise 
specified. As in [ll], we say that an algorithm 


is completely parallelizable if its speedup is 
Q(P). 


Since many algorithms have synchronization 
points (i.e. points that all PEs must reach before 
any PE passes), it is important to note the fol- 
lowing constant-time algorithm for synchroniza~ 
tion: Let C be an otherwise unused shared cell 
with initial value 0; each PE replace-adds 1 to C 
and waits until C has value P; the PEs are then 
synchronized and may continue. 


Often a program will require many successive 
synchronizations. This can be achieved by having 
three synchronizing cells and rotating the syn- 
chronizations: Let C,, Cy; Cy, be three (other- 
wise unused) shared cells with the values Cc, and 
Cy initially 0. For the first synchronization 
point, each PE replace-adds 1 to C, 
until C, has value P; the PEs are then synchron- 
ized. to 0. For the 
next synchronization Cc, is used for replace-adding 
and when its value reaches P, C, is set to 0. 
Note that C, is set to 0 before C, so 


for the third synchronization C, may be used, and 


and waits 


Some one of the PEs sets C, 


reaches P, 


then C, set to 0. The initial state having been 
reestablished, we may again use Ci and set C, to 
0, etc. 
Summing 

Suppose that we are given an array 
W= Wor cee sWy_y of N values, and wish to compute 
the partial sums 8; = Wo + cece + Wy for 
i = 0,...,N-l1. This problem can be solved in time 


Q(N/P + logP) using standard algorithms for solv- 
ing linear recurrences (e.g. see [11]). Thus sum- 
ming is completely parallelizable for 
N= Q(PlogP). Of course, the summing algorithm 
may be generalized by substituting any associative 
binary operation for addition. Note that if only 
Sy; (the total "sum") is desired, more efficient 
algorithms exist for-certain binary operations. 


For example, the maximum of N values can be deter- 
mined in time O(N/P + loglogP) (see [12], [13]). 
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Integer summing. When finding the partial sums 
of N integers, we can make heavy use of the 
replace-add to solve the problem in time 
Q(N/P + loglogP) by adapting Valiant’s algorithm 
for finding the maximum [13]. 


First consider the case when P = N(N-1)/2. 
This problem is easily solved in constant time: 
Assign the first PE the task of finding S) (the 


second partial sum), the next two PEs the task of 
finding Sos the next three PEs the task of finding 
S,, etc. The partial sums s, can be independently 


determined in constant time by initially setting 


S, = Wo and then replace-adding Wy sceesWy to S.° 


Next assume merely P?N; this problem can be 
solved with the following algorithm: 


If N=1 set So Otherwise perform 


the following five steps. 


(1) 


Wo and return. 


Partition the N items into g = [n2/(2PHy) | 
groups GioeeesG each of size h = (2P4+N)/N, 


so that the first h items are in group Gy» 
the next h items in G,, etc. 


Partition the PEs also into g groups with 
h(h-1)/2 = P(2P4N) /N2 PEs in each. 


Assign each group of PEs to a distinct group 
of items, and solve the summing problem for 
each group independently using the preceding 
dependent-size integer summing algorithm. 


(2) 
(3) 


Apply this algorithm recursively to the total 
sums t, of each group G, > thereby producing 


es the partial sums of the t,’s. 


(4) 


(5) Add u,_, (or 0 if i = 0) to each partial sum 


in G,- 

Steps (1), (2), (3), and (5) each requires con- 
stant time and, since P < N(N-1)/2, the depth of 
the recursion at step (4) is O(loglogN - 
log log(P/N+1)) (see Valiant [75]), so the entire 
algorithm requires time O(log logN - 
log log(P/N+1)). In particular, the saturated 
problem (i.e. N=P) is solvable in time 
@(log logP). 

Finally, consider the case when P<N, and use 
the following algorithm: 


(1) 


Partition the items into P groups Go» 22+ sGp_y 
each of size N/P, so that the first N/P items 
are in group G), the next N/P items in Gi> 


etc. 


(2) Apply the sequential summing algorithm to 


each group independently. 


(3) 


Apply the preceding saturated integer summing 
algorithm to the total sums t, of each group 


G,> thereby producing Ups eeesUp_y 7m the par- 
tial sums of the tS. 


(or O if i=0) to each partial sum 


(4) Add Us 


in G.. 
i 


] 


Step (1) requires constant time, steps (2) and 
(4) require O(N/P) time, and step (3) requires 
@(loglogP) time. Thus, the entire algorithm 
requires O(N/P + loglogP) time and is completely 
parallelizable for N = Q(P loglogP). 


Unordered integer summing. The unordered sum- 
ming problem is the problem of finding the partial 
sums of some one unspecified permutation of the 
data. If the w, are integers this sum can be 


formed in time O(N/P): initialize some temporary 
location T to O and replace-add every w, to T. 


The result of the addition of Wy is the partial 


Sum S,.e 
i 


Permutations 


Suppose we are given an array W = Wor eee oWn 
of size N = PL and a permutation nm of O,...,N-l. 
Then the permutation problem is to permute W 
according to mt. 


Algorithm. One algorithm for solving this 
problem allocates a temporary array T of size N 
and performs the following two steps: 


(1) Copy W directly into T (i.e., each PE, moves 


Ww . into t for O< j<L). 


iL+j iL+j 
(2) Copy T back into W according to nm (i.e., each 
PE, moves tine into We(iL+4) for O0<j<L). 
Analysis. Steps (1) and (2) both require time 
O(N/P). Thus the entire algorithm requires time 
O(N/P) and is completely parallelizable for 
N = Q(P). 
Variant. Unfortunately, the above algorithm 


requires extra storage proportional to N. When 
is known in advance, i.e. n is not part of the 
data, the problem is solvable in time Q(N/P) using 
extra storage proportional only to P: Partition W 
into R and S where [R|=P and |S|=N-P. Copy R 
into a temporary array R’ (thus W=R’ (disjoint) 
UNION S). Store into R (from R’ and S) the items 
in n ‘(R). (Note that = and hence an are known 
in advance.) Store the items of R’, that have not 
been placed back into R, into the free locations 
of S. Now the problem has been reduced, in con- 
stant time, from W of size N to S$ of size N-P. 
N/P such iterations will effect the entire permu- 
tation. | 


Packing | 


Suppose we are given an array W = Wor eeesWy_y 


of N = PL items, some of which are marked. The 
packing problem is to move the i-th marked item to 
the i-th location of W. 


Algorithm. 


(1) Use integer summing (with marked items 
assigned 1 and unmarked items assigned 0) to 
determine the desired destinations of the 
marked items. 

(2) Partition the array W into L blocks of P con- 
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tiguous items. Perform the following two 
steps for k = 0,...,L-l. 


(a) Each PE, stores the i-th item of the k- 
th block into a (distinct) temporary 
location tj. 

(b) Each PE, whose associated item t, is 
marked moves the item from t. into its 
desired destination in W. 

Analysis. Step (1) is integer summing and thus 


requires time O(N/P + loglogP), and step (2) con- 
sists of N/P iterations of two 0(1) operations and 
thus requires time O(N/P). Therefore, the entire 
algorithm requires time O(N/P + loglogP) and is 
completely parallelizable for N = Q(P loglogP). 


Variants. The unordered packing problem is the 
same as the packing problem, except that it is 
unnecessary for the marked items to maintain their 
original relative order. This problem can be 
solved in time O(N/P) by replacing summing with 
unordered summing in step (1) above. Thus the 
unordered packing problem is completely parallel- 
izable for N = Q(P). 


Unfortunately, this algorithm requires extra 
storage proportional to N. Unordered packing, 
however, can be realized in time O(N/P) using 
extra storage proportional only to P: delete step 
(1) and begin step (2) with a replace-add to 
determine, at the k-th iteration of step (2), the 
desired destination of the items in the k-th 
block. 


Merging and Sorting 


In [8] we show that two lists of size m,n, 
where m<n and N= mtn, can be merged in time 
QO(N/P + loglogm). Thus, when m=n merging is 
completely parallelizable for N = Q(P loglogP). 


We also show in [8] how this merging algorithm 
can be used to obtain a sorting algorithm which 
requires time 


tog Nn 108 108 N 
o(42 N loglo Ny 


log log logN BOG. (OE) SN & 


o(P log log P) 


and 
o(S 28%) for N = Q(@ loglogP). 
Thus, sorting is completely parallelizable for 


N = Q(P loglogP). 


An important special case. Suppose we wish to 
sort an array W consisting of N (not necessarily 
distinct) integers in the range 1 to N. The fol- 
lowing algorithm solves this simpler problem in 
time O(N/P + loglogP). 


(1) Create an array C of size N initialized to 0. 

(2) Count how many items have each value by 
incrementing (via replace-add) C(w, ) for all 
1. Aly sete gd < 

(3) Apply integer summing to C and then set 


D(i) = C(i-1) (and D(1) = 0), so that D(i) is 
the number of items less than i. 


(4) Copy W into a temporary array T. 


(5) The final location j of the i-th original 
item is obtained as replace-add(D(t,;),1). 
Set w equal to t,- 


To illustrate this algorithm consider the prob- 
lem of sorting the array W = (2,1,5,3,2). After 
step (2) above C = (1,2,1,0,1), where C(i) is the 
number of items with value i (e.g. two items have 
value 2 and no items have value 4). At step (3) 
summing is applied to transform C_ into 
(1,3,4,4,5)3; CCi) now represents the number of 
items less than or equal toi. D = (0,1,3,4,4) is 
derived from C by shifting the values of C right 
one position and represents the number of items 
less than i. At step (4) W is copied into T. 
Finally at step (5) the final location of the i-th 
item of W is determined by replace~adding 1 to 
D(t,)- For example, the fourth item of T is 3 so 


its final destination in W is D(3)+1 = 4. More 
interestingly, since the first and fifth items of 
T are both 2, they both replace-add 1 to D(2) to 
determine their final destinations; one of them 
effects the replace-add first and its final desti- 
nation in W is D(2)+1 = 2, and the other effects 
the replace-add second and its final destination 
in W its D(2)+1+1 = 3. 


Steps (1), (2), (4), and (5) all require time 
O(N/P) and step (3) requires time 
O(N/P + loglogP). Therefore the entire algorithm 
requires time O0(N/P + loglogP) and its speedup is 
O(N/(N/P + loglogP)). It is completely parallel- 
izable for N = Q(@P log log P). 


An alternate algorithm with good average-case 
behavior. We now describe a parallel version of 


quicksort and show that its average-case time com- 
plexity is O((NlogN)/P). Thus, using average- 
case analyses, comparison-exchange sorting is com- 
pletely parallelizable for N = Q(P). 


Suppose we wish to sort an array W of N items. 
First consider the case when N=P. 


If N<l then W is sorted. 
the following steps. 


Otherwise perform 


(1) Choose an item M at random from W. 


(2) Let S, E, B be the sets of items smaller than 
M, equal to M, and bigger than M, respec- 
tively. Apply unordered packing three times: 
first to pack the items of S to the beginning 
of W, then to pack the items of E immediately 
after, and finally to pack the items of B to 
the end of W. 


(3) Assign |S{| PEs to S and |B| PEs to B, and 
recursively apply the algorithm to S and B 
concurrently. 


We now analyze this algorithm under the assump- 
tion that the items are all distinct, which cannot 
decrease the (average) execution time. Suppose 
that the item M chosen during step (1) is the i-th 
smallest item in W. Then the algorithm is recur- 
sively applied to sets of size i-l and N-i. Since 
i is uniformly distributed over {1,...,N}, we are 


essentially constructing a random binary search 
tree of size N. The expected depth of the recur- 
sion is the expected height of this tree, which is 
known to be O(log N) (see Robson [79]). Since 
only a constant amount of time is required for 
steps (1) and (2) (see section on packing), the 
entire algorithm requires time O(log N); since 
N=P, the speedup is 0(P logP)/O(logP) = O(P). 


For N>P, we employ the above algorithm as if 
we had N PEs by assigning each PE the work per 
formed by N/P PEs. This gives a time complexity 
of Q(N/P) O(logN) = OC((NlogN)/P) and a speedup of 
O(P). 


As a practical consideration, choosing M likely 
to be near the true median lowers the average-case 
time complexity of the algorithm (but not its 
order). One possibility is to use the median of a 
random sample of some R<N items. If we choose 
R = O(YP) the median can be found in only constant 
time by sorting. 


Selection 


Suppose we are given an array W of N items from 
an ordered set and an integer 1 < k < N, and wish 
to find the k-th smallest item in the array. For 
N<P we know of no algorithm faster than sorting. 
However, for N>P we can parallelize the linear 
sequential algorithm of Blum et al. [1] as fol- 
lows. 


If N<P sort the items; the k-th 
If N>P per- 


Algorithm. 
smallest item is the k-th item in W. 


form the following four steps: 


(1) Partition the items into P groups of size 
(essentially) N/P. Assign the i-th PE to the 
i-th group and use the sequential fast median 
algorithm to find the median item in each 
group. 


(2) Sort these medians to find M, the median of 
the local medians. 


(3) Let S, E, and B be the sets of items smaller 
than M, equal to M, and bigger than M, 
respectively. Use unordered summing’ to 
determine |S| and |E| (the cardinalities of 
the sets S and E). | 


(4) Perform one of the following three steps: 


(a) k < [S{: Pack S using unordered packing 
and then recursively apply this 
(generalized-median finding) algorithm 
to S, still searching for the k-th smal- 
lest item. 


(b) |S] < k < IS|+]E]: 
item is M. 


(c) |S]+]JE| < k: Pack B using unordered 
packing and then recursively apply this 
(generalized-median finding) algorithm 
to B, but now searching for’ the 
k- {S| - |E| smallest item. 


The k-th smallest 


Analysis. The important property of this algo- 
rithm is that at each recursive application at 
least a quarter of the remaining items are elim 


from 


inated consideration. After log, 73(N/P) 


recursions, the number of items remaining is no 


more than 


log, ;4(N/P) 
n3/4) 13 

at which point we apply the sorting algorithm. 
The Somplenity OF Stee (1) is O(N/P), of step (2) 

Oo og lo . 

is 0o¢ Tie log lon? ), of step (3) is O(N/P), and 
of step (4) is O(N/P + T,(3N/4)) where T,(N) is 
the complexity of the entire algorithm. Thus, the 
complexity T,(N) is 


logP log log P 
(to log lo Py 


log log log P age 


and 


oc tos? log log P 
log log log P 


= O(N/P + (log, (N/P) + 1) 


+N/P + T,(3N/4)) if NDP 
log P log log P ) 

log log log P 
Hence the algorithm is ge ated parallelizable 


~ oce_logP (Clog log P) 
for N= aC log log log P ). 


IV. Conclusion 
We have presented algorithms for solving 
several basic problems on a replace~add-based 
paracomputer. All of the problems discussed. are 


completely parallelizable, at least for large 
enough problems. In fact, for none of the prob- 
lems does the problem size have to be signifi- 
cantly larger than the number of PEs in order to 
attain maximal speedup. 


While a paracomputer not enhanced with the 
replace-add will also attain maximal speedup for 
solving the above problems, such a machine some—- 
times requires slightly larger problems to attain 
this goal. What is perhaps more significant is 
that a machine without the replace-add is more 
difficylt to program and sometimes requires exor- 
bitant overhead in order to allocate the PEs to 
their tasks. This manifests itself in the case of 
merging and therefore’ sorting: The Borodin- 
Hopcroft [2] technique for solving the PE alloca- 
tion problem on the weaker model is not only unob- 
vious but requires at each iteration several steps 
to reallocate the PEs. In contrast, on a 
replace-add~based machine it is extremely easy to 
solve the PE allocation problem and the resulting 
algorithm has low overhead. 


In summary, the replace-add-based paracomputer 
performances for solving the above problems are, 
in our opinion, quite impressive. Adding to this, 
their ability (as noted earlier) to realize highly 
concurrent operating system primitives, makes 
replace-add-based paracomputers an architecture 
worth striving for. While fan-in and other limi- 
tations prevent their physical realization, they 
can be reasonably approximated by machines using a 
multistage interconnection network [3]. The 
"Ultracomputer group" at the Courant Institute of 
New York University is presently designing a pro- 
totype of such a machine and believes that a full 
scale version containing thousands of PEs will be 
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constructible by the end of the decade. 
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CONSTRUCTING PARALLEL PROGRAMS AND THEIR TERMINATION PROOF 


J.P. BANATRE, M. BANATRE, P. QUINTON 
I.R.1I.S.A. 
Campus de Beaulieu 
35042 RENNES Cédex - France 


Abstract -- This noteconsiders the construc— 
tion of parallel programs and the production of 
their termination proof. In the first part, an ori- 
ginal scheme for describing process cooperation is 
presented and it is shown how this scheme may be 
used for the production of termination proofs. Ba- 
sically, the approach consists of mapping a system 
of processes into a multiset which is repeatedly 
decreased throughout the computation. Using pro- 
perties of a well-founded ordering on finite multi- 
sets, we derive termination proofs. In the second 
part, an example illustrates the method and other 
applications are suggested. 


1. Introduction 


The subject of construction of parallel (or 
distributed) programs gains more and more interest, 
as microprocessor technology is goind ahead. 


Several languages have been proposed which 
allow the description of parallel programs for 
example [3]. Some of the theoretical aspects in- 
volved in the semantics of these programs have been 
investigated in several groups. Another field of 
interest concerns the proof of strong correctness 
for distributed programs. Recent progresses are re- 
ported in [4]. It appears from the above enume- 
ration that several kinds of investigations are 
going on "collaterally", but the problem of cons- 
tructing parallel programs which surely terminates 
is never addressed globally. This is the topic of 
the present study. 


Concerning the termination of his repetitive 
construct, Dijkstra states in [1], p. 41 : "The 
basic theorem for the repetitive construct asserts 
for a condition P that kept invariantly true that 


(P and wp(DO,T) => wp(DO,p and non BB)) 


Here the term wp(DO,T) is the weakest precondition 
such that the repetitive construct will terminate. 
Given an arbitrary construct DO it is in general 
very hard -if not impossible- to determine 


wp(DO,T). I therefore suggest to design our repeti- 


tive constructs with the requirement of termination 
consciously in mind, i.e., to choose an appropriate 


proof for termination and to make the program in 
such a way that it satisfies the assumptions of the 
proof". Then he proposes to map the DO construct 
variables into a well-founded set, chosen to be the 
natural numbers under the > ordering. This idea 
provides a straighforward method for proving loop 
termination. 


Our proposal applies the same type of idea to 
the construction of parallel programs. Constructs 
that we propose for expressing parallel programs 
are designed with the requirement of termination 


0190-3918 /82/0000/0224$00.75 © 1982 IEEE 


clearly in mind. 


An original scheme for describing process coo- 
peration is presented in section 2 and section 3 
shows how this scheme may be used for termination 
proofs. Application of these tools in the program- 
ming of an example is demonstrated in section 4. 
Section 5 contains a brief review and discussion. 


2. A scheme for process cooperation 


Consider a system S of active processes 
Pl>*++sPn- Each py is provided with a "weight" w j. 
Cooperation between any couple (pji,pj) of S is go- 
verned by a condition R(w4,wy) which has to be met 
before any communication between pj and Pj occurs. 
Processes pj and p; are said to be neighbours when 
their weights verify condition R, otherwise they 
are "isolated". This neighbourhood relationship is 
dynamic since after cooperation, processes pj and 
pj may change their respective weights in such a 
way that R(wy,wj) does not hold anymore (but 
R(wi,wk) and R(wj,w,) may hold ...). 


The overall system S$ becomes "steady" when ali 
its component processes become isolated. 


The following program fragment gives an in- 
formal description of the functioning of process 


Pi > 


Py : weight (wi) exchange 6; with 4, 
begin 
do 
wait(#coupling with a process pj or 
steady state#) ; 
if # steady state # then # exit do loop # 
else by (w4,55) 


Fig. 1 


Process py possesses a weight w; and when 
coupled with a process ps "receives" information 
65 from ps and "sends" information 6, to py- Only 
after this information exchange, processing 
b, (wz,6,) takes place. If system S becomes steady, 
process p; executes its postlude Yq and terminates. 


3. Termination proof 


A usual tool for proving the termination of 
program is the well founded set : a set of elements 
and an ordering > defined on these elements, such 
that there can be non-infinite descending sequen- 
ces of elements. The idea for proving termination 
of a process is to find a termination function that 


maps process variables into a well-founded set -the 
value of the termination function being successi- 
vely decreased through out the computation. Natural 
number under the 2 ordering are often used for 
proving termination of loops [1]. In [2], multiset 
ordering is shown to be well-founded and is used 
for proving termination of production systems. Mul- 
tisets are like-sets, but may contain multiple 
occurrences of identical elements. Consider two 
multisets of natural numbers Mj and M2, the rela- 
tionship M,;>>Mj) holds if Mj can be obtained from Ml 
by replacing one or more elements of M] by any fi- 
nite sequence of natural numbers, each of which 
being smaller than the replaced one (more details 
in [2]). 


Our idea consists of applying the multiset or- 
dering for proving termination of our parallel pro- 
grams. To each process p; is associated a termi- 
nation function fj; (corresponding roughly to by of 
fig.1) which maps the weights into the set of na- 
tural numbers under the usual ordering. Each appli- 
cation of the funtion reduces the weight through 
the computation. So wy will take successively the 
following values {w41,wj2,---,Wik...} such that, 
¥u,v u>-v = wWyy>W.,. Assume now, that every pro- 
cess py is provided with such a function then the 
initial state of the global processing (involving 
PL»>+++,Pn) may be described by the multiset {wy1, 
W12,-+++,W]pst=W1, any subsequent state {wi1,wi2,.-., 
WinJ=Wy will be such that W;<<W, and any state de- 
rived from Wy will be such that Wj<<Wj. Thus we ha- 
ve a simple means for proving termination of pa- 
rallel programs built according to our scheme. 


4. A short example 


4,1. The problem and its solution 


Consider the problem of sorting a set S of n 
(different) integers in ascending order. The 
following algorithm may be imagined : 

A process is associated to each number and 
initially a weight n is attached to each process 
So there are n processes and Wj = {n, ..., n}. 

n 

The condition R between two processes pj and 
p, is defined as R(wy,wj) = (wi-w; = 0) i.e., two 
processes may cooperate iff their respective 
weights are identical. 


Consider two processes p; and p. such that 
R(wy,w.) is true. The actual processing performed 
by py Consists in comparing value v,; (number to 
which py is associated) with v3. If vy>vj then de- 
crease wy by one, otherwise w; remains unchanged. 
Pj performs the symmetric processing. 


The function f; attached to py is the 
following : 


function f,; = if vi<v; then wy := w,-l fi 

This function possesses the property required 
from termination functions, proof of termination of 
our algorithm is then straighforward. 


225 


4.2. Functioning of the algorithm 


Let S be {7,4,2,3,1}, and each process py; be 
represented by a couple (vy,w4) where v, represents 
the value to p; and wy the weight of pj. 


A possible processing leading to the solution 
is the following : 


(7,5) (4,5) (2,5) (3,5) (1,5) 
(7,5) (4,4) (2,4) (3,5) (1,5) 
(7,5) (4,4) (2,3) (3,5) (1,4) 
(7,5) (4,4) (2,3) (3,4) (1,3) 
(7,5) (4,4) (2,3) (3,3) (1,2) 
(7,5) (4,4) (2,2) (3,3) (1,2) 
(7,5) (4,4) (2,2) (3,3) (1,1) 


Two communicating processes are linked by an 
horizontal line. 


Of course this is one among the possible paths 
leading to the solution. This algorithm is non- 
deterministic as it does not indicates how coope- 
rating couples are selected. 


When the system reaches its steady state, 
weight wy of process py represents the position of 
vz; in the sequence s ; wi may then be printed to- 
gether with vy by the yqy part of process pj. 


4.3. Proof of termination 


W, = {n,...,n}, then given any configuration 
W; derived from Wj (by application of functions 
fi's), we have Wj<<W), and Wk derived from Wy3 
W;<<Ws. Configurations Wy have a lower bound, 
Wip = {1,2,3,4,5}. The termination proof is then 
straightforward. 


Remark. If the numbers are not assumed to be diffe- 


rent, the condition R becomes wy = ws a vi # Vie. 
Thus the weight is a couple of integers (w,v). Ter- 
mination proof is identical. 


5. Review and discussion 


Appropriate language constructs have been de- 
signed in order to describe processes and condi- 
tions. This cooperation scheme has been applied to 
the solution of a variety of problems : parallel 
pretty printer, parallel compile-time symbol reso- 
lution, implementation of unvariant properties in 
distributed systems (these properties are related 
to logical time, weights w. are timestamps asso- 


i 
ciated to each process) ... 
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MULTIPLE PIPELINE SCHEDULING IN VECTOR SUPERCOMPUTERS 
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Abstract -- A Parallel task scheduling model 
is proposed for multi-pipeline vector processors. 
This model can be applied to explore maximal con- 
currency in vector computers, like the CRAY-l, 
CYBER-205, STAR-100, TI-ASC, and IBM 3838. The 
optimization problem of simultaneously scheduling 
multiple pipelines with vector tasks is shown to 
be NP-complete. Thus, we have developed several 
heuristic scheduling algorithms,which can be easily 
implemented in vector processors with low system 
overhead and high throughput performance. 


Introduction 


High-performance vector computers are demand- 
ed in numerical weather forecasting, structural 
analysis, seismic data processing, simulation of 
nuclear reactors, aerodynamics simulation, and 
among many other large-scale scientific/engineering 
computing applications. In contrast to a scalar 
processor that processes one data element ata time, 
a vector computer has vector instructions applied 
to groups of data elements, called vectors. Vector 
instructions have inherent advantages over equiva-~ 
lent scalar instructions embeded in DO-loops. A 
vector instruction saves repeated instruction 
fetches in a DO-loop and eliminates index and 
branch instructions for loop control. Vector com- 
puters appear as array processors or pipeline com- 
puters [8]. The array approach uses replicated 
processing elements (PEs) to explore spatial para- 
llelism, such as 64 PEs in Illiac IV and 16 PEs in 
Burroughs Scientific Processor. Notable pipeline 
computers include Texas Instruments ASC systen, 
CDC STAR-100, IBM 3838, Cray Research CRAY-1 [15], 
and CDC CYBER-205 [3]. Pipelined computers have 
been widely adopted in commercial computer systems. 
Array processors appear only in a few research 
computers [8, 10]. 

In this paper, we develop new methods to pro- 
mote parallel execution of vector instructions in 
a pipelined computer. Concurrencies in programs 
should be exploited by multiple pipelines in a 
vector processor. Each task system contains a set 
of vector instructions (tasks) with precedence 
relation determined only by data dependencies. 


Each pipeline processor is assumed to be multifunc- 


tional, that is, capable of executing different 
functions at different times, but only one static 
function at a time. 

For parallel vector processing, multiple 
pipelines are used to reduce the execution time of 
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all instructions in a given task system. The 
problem of scheduling multiple scalar tasks on 
multiple pipelines has been studied by Ramamoorthy 
and Li [14], and Brune, et al [2]. Their results 
indicate that some optimal scheduling algirithms 
can be obtained for only very restricted classes 
of task systems. Li [11] studied the scheduling 
problem for restricted vector loops. We are 
interested in using several pipelines simul taneous- 
ly processing a long vector task. A long vector 
task is partitioned into many subvectors to be 
processed by several pipelines simultaneously. 
Lloyd [12] suggested the use of several processors 
for a single task. The number of pipelines 
required to process a vector task is determined 
by the chosen scheduling algorithm. 

In a multi-pipeline computer, significant © 
overhead time is required to execute a vector 
instruction due to start-up and pipeline flushing 
delays [10]. This overhead time may reduce the 
performance of the pipeline system. Li neglected 


overhead in order to simplify the scheduling model 


for vector tasks. [11,14] . Bruno and Downey [1] 
discussed the complexity of task sequencing in- 
cluding set-up time. We consider system overhead 
for scheduling vector tasks in multiple pipelines. 
We prove that the multi-pipeline scheduling pro- 
blem is NP-complete, even for restricted task 
classes. Heuristic scheduling algorithms are deve-: 
loped to enable parallel vector processing. Per- 
formance bounds are derived for these heuristic 
algorithms. Several example task systems are used 
to illustrate the proposed vector scheduling me- 
thodology. | 


The Vector Task Scheduling Model 


A functional block diagram of a typical mul- 
tiple-pipeline vector computer is shown in Fig. l. 
Main memory is often interleaved to minimize the 
acces time of vector operands. Instructions and 
data may appear in either vector or scalar forms. 
The Instruction Processing Unit (IPU) fetches and 
decodes both scalar and vector instructions. All 
scalar instructions are dispatched to the Scalar 
Processor for execution. The Scalar Processor 
contains multiple scalar pipelines. After a vec- 
tortor instruction is recognized by the IPU, the 
Vector Controller takes over in supervising its 
execution. The functions of this controller in- 
clude decoding vector instructions, calculating 
effective vector-operand addresses, setting up the 
Vector Access Controller and the Vector Pipelines, 
and monitoring the execution of vector instrut~- 
tions. The Vector Access Controller is responsible 
for fetching vector operands by a series of main 
memory accesses. The Vector Buffer acts as a 
cache to close up the speed gap between the vector 


Access Controller and Vector Pipelines. We assume 
m identical Vector Pipelines, each of which is 
static and multifunctional. 

We concentrated our study on vector tasks 
exclusively. The vector Controller is capable of 
scheduling several vector instructions simultane- 
ously. The time required to complete the execut- 
tion of a single vector instruction (vector task) 
is measured by (Kogge [10]). 


(1) 


where to is the overhead time due to start-up and 
flushing delays, ty is the average latency between 


two successive operand pairs, and L is the vector 
length (the number of component operands in a vec- 
tor). The start-up time is measured from the ini- 
tiation of the vector instruction to the entrance 
of the first operand pair into the pipeline. The 
flush time is measured from the entrance of the 
last operand pair to the completion of that vector 
instruction. The average latency is measured be- 
tween two successive operand pairs entering the 
pipeline. The parameter T = t, *° L is called the 


QR 
productive time. Parameters t, and te 
different vector instructions. The overhead time 
ty may require several hundreds of pipeline cycles. 


vary with 


The average latency ty is usually one, two ora 


few pipeline cycles. 

Given a task system, we wish to schedule the 
vector tasks among m identical pipelines such that 
the total execution time is minimized. For sim- 
plicity, we assume equal overhead time to for all 


vector tasks. A vector task system can be charac- 
terized by a four-tuple (Il, <,t , tT], where 


(1). 


Il = {T) sTos+++sT J} is a set of n vector 


tasks. 
< is a partial ordering relation, speci- 
fying the precedence relationships among 
the tasks in set Il . 


(2). 


(3). to is the overhead time of each vector 
task. 
(4). tT: I +R is a time function defining 


the productive time T(T,) of each task T,- 


A parallel schedule for a vector task system 
Ul tie 4 is a total function f, mapping each 


task Tell into a finite subset of interval-pipeline 
pairs, where an interval-pipeline pair ({x,y] »P,) 


represents the event that a subtask of T is being 
processed by pipeline P. during time interval 


siney tT ],B,)}, then the following conditions 
k 


kk 
must be met in order to smooth the pipeline opera- 
tions. 

k 


x = 
joi Gy 


‘coms > — = 
(1). y¥, ~ % ty and t,) 
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tT (T), for all Kyo¥, © Rey dh. = hy teeaw ks 
(2). Be PGP ieee fort 3 152 wicks 
If B, =B.,, h Reg ; = ‘ 

7 j then (x, y,) ‘eg (x5 5¥,) ) 


(3). 


At time t, vector task T is executed by 
a subset of k pipelines {B,: x,s ts 
i 


dn? DD, ee leh 
The start time is S(T) = Minlx, :X, 50.6, 


Vy 


(4). 
x, J = time is F(T) = Maxly} sy,» 


eee Yy 
Morever, a parallel schedule for a given task 
system must satisfy the following two properties: 


(a). Different vector tasks cannot be pro- 
cessed by the same pipeline at the same 
time because of using only static pipes. 

(b). Whenever Tt, < Ty then 8(T,) = F(T,) as 


governed by the precedence relation 
among tasks. 
The finish time w of a parallel schedule for 
n taske is defined by W = Max{F(T,),F(T,),..., 
F(T). An optimal parallel schedule has the mi- 


nimal finish time w. among all parallel schedules 


for the given task system. Our objective is to 
find an "optimal" or "suboptimal" parallel sche- 
dule for any given vector task system. The follow- 
ing example will clarify the problem environment. 


Example l. 

Given a vector task system [II,<, tT] as 
specified in Fig.2(a), where Il = {T, »T,.T,,7,} : 
t= bs T(T,) = 10, T(T,) = 2, T(T 4) = 6, and t(T,) 
= 6, and t(T,) = 2, We want to schedule the four 


tasks on two (m=2) pipelines. Using the shorthand 
notation T; T(T, ) for 1s i= 4, a parallel 


schedule f, is obtained in Fig.2(b), where the 
shaded area shows the idle period of the pipelines. 
The vector task Ty is partitioned into two sub- 


tasks, Ty and Tio? with T= 7 and Ti = 3. Si- 
milarly, the vector task T. is partitioned into 
two subtasks, T 34 and Tao with T34 = 4, T 35 => 26 


The parallel schedule fh is specified by the follow- 


ing mappings. 


f(T.) = {([0,81,P,),({3,71,P,)}, with S(T, )=0 
and F(T,) = 8, 

£(T,) = {([8,111,P,)}, with S(T.) = 8 and 
F(T.) = 11, 

£(T,) = {([8,131,P,),({11,141,P,)}, with S(T,) 
= 8 and F(T.) = 14, 

£(T,) = {C(0,31,P,)}, with S(T,) = 0 and F(T,) 


a Bie 


The finish time of the parallel schedule fi 
is thus w = F(T.) = 14 as revealed in Fig.2. 


The NP Completeness of Pipeline Scheduling Problem 


NP-complete problems have received much 
attention in recent years [6]. Ullman [17] proved 
that the general preemptive scheduling problem is 
NP-complete. Ramamoorthy and Li [14] studied the 
scheduling problem for shared-resource pipeline 
systems. They considered only scheduling scalar 
task systems. We consider here the scheduling of 
vector tasks. The multiple-pipeline scheduling 
problem can be stated as a feasibility problem: 
Given a vector task system Il, < ,t ,T] >» a vector 


computer with m identical pipelines, and a dead- 
line D. Does there exist a parallel schedule f 
with finish time W such that w = D? 

We shall consider two partial ordering rela- 
tions over a given task set [I. An empty relation, 
€, corresponds to the set of all independent 
tasks. A tree relation, 0, is a precedence rela- 
tion in which all tasks are related by a single- 
rooted tree. Proofs of all theorems can be found 

in reference [16]. 


Theorem 1: 


The feasibility problem for scheduling a task 
system of independent vector instructions over 
multiple pipelines is NP-complete. 


Theorem 2: 


The feasibility problem for scheduling a tree 
system of vector instructions over multiple pipe- 
lines is NP-complete. 

If all vector tasks in an independent task 


system have equal productive time, i.e. T, is a 


constant for i = 1,2,...,n, then it is possible to 
solve the feasibility problem in polynomial time. 
This suggests that, with additional restrictions 
on the scheduling problem, one may expect a poly- 
nomial-time scheduling algorithm. In this paper, 
we schedule vector tasks with different productive 
times. The NP-completeness of the above two feasi- 
bility problems indicates that the multiple—pipe— 
line scheduling problem is indeed very hard to 
solve. Due te this computational intractability, 
heuristic scheduling algorithms are desired in 
real-life system designs, even though heuristic 
schedules may not be necessarily optimal. 


Scheduling Independent Vector Tasks 


The scheduling algorithm for independent 
tasks is specified with an input andatask system, 
{ Il, < st stl, for m identical vector pipelines, 


where II = {T,,--.,T}, T(T,) at for i= 1,..., 


n. The output is a parallel schedule, f, for the 
given task system of independent tasks. 
Let t, be the time span of using pipeline P 


in the execution of a given task system. The over- 
head time and productive time are both included in 
te Let k be the total number of partitions for 


all vector tasks in a parallel schedule. If no 
vector task has ever been partitioned, the average 
n 

time span is computed by ta=G2dy T, +n T )/m. If 
a parallel schedule has k partitions, then the 

n 
we G2 Te 
The criterion for developing the heuris- 


average time span becomes t 
ke? ts 
a 


tic algorithm is to make tye J = Ly2yeec My. as 


oe aca a 
fe) 


close to the average value ts or t, as possible. 


k 
We assume the condition that t.? ti /2 for any 
practival task system. 


ALGORITHM A (For scheduling independent tasks): 


Step 1. /Initialize parameters/ 

2 Ds: ee Fs oe 0: Toe. 4S Lee ls 
£(T,) « @ for i= 1,...,n; 

Step 2. /Assign task T, to pipeline P, and then 
check if the scheduling process is com- 
plete/ 

CY Sth hat, es es 

An ) i 
If i =n and j =m, then assign the task 
by £(T,) + £(T,) U C(lt",t,1,P,)}5 and 
terminate the process. 

Step 3. /Check if the time span of pipeline P, is 


j 


within a given bound/ 


If |t, - t_]| S t_ /2 then assign task T, 
— "4 a o— i 
to pipeline P, with £(T,) + £(T,) U{(It', 
t,1.P.)t5 increment the indices j + 4 + 1 


and i+ i+ 13; and go to Step 2. 


Step 4. /Compare the time span of pipeline P, with 
the allowable bound/ 


If t. > t_, then go to Step 5, else assign 
eave j a°-— —— 

thr task T, to pipeline P, with £(T,) € 
£(T,) U {(fe'.t.1.P,)}5 increment i+ i+ 


1; and go to Step Zs 


Step 5. /One subtask of T, is being processed by 
pipeline a Update the average time 


span. Assign a subtask of T, to pipeline 


Pasi! 


Set £(T,) « £(T,) U {(It',t, + t/21 Pits 
ae t - (t. + t/2)3 t, + t. + t/23 
t,<+t, + t,/m j+gtils T+ T)3 t' < 
t,3 t, <«t' + to + T 53 and go to Step 2. 


The parallel schedule generated from Algorithm 
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A is denoted as fae Algorithm A adopts a bin 


packing approach [4] by assigning all possible 
tasks to pipeline Py before considering pipeline 


P, and so on. This approach has been successfully 


used by McNaughton [13] to construct the shortest 
preemptive schedule for independent taskson m2 2 
identical processors. The complexity of Algorithm 
A is O(n). We have used the time span criterion 

It, = tI = t/2 to decide when to partition a 


vector task and update the average time gpan ts 


The maximum number of partitions for a given task 
system is m- l. 
The finish time Ww, of an optimal schedule EO 


for a task system of independent tasks is lower 
bounded by the average time span ta This lower 


bound occurs when no task is being partitioned. 


> 
wet, (2) 


Theorem 3: 
Applying Algorithm A to the independent task 
system (Il, € ,t 57] over m pipelines, we obtain on 


the following upper bound on the finish time w 


A 
of the schedule fae 
wv, = t. + (m - 1)t /2 (3) 


We have performed a series of simulation ex- 
periments in order to compare our results with 
three known scheduling algorithms: First Come 
First Serve (FCFS), Randomly Choose (RC), Longest 
Process First (LPF). We consider m = 4 pipelines 
with overhead time t, = 1. The productive time 


of any task system is a random variable, uniform 
ly distributed in the range [1,999]. We examined 
100 task systems each with n independent tasks 
for 4 =n 20. Schedules for each heuristic al- 
gorithm and their average finish time are genera- 
ted. Let W, be the average finish time of a 


schedule and Wp be the lower bound of the average 
finish time for an optimal schedule. The perfor- 
mance ratio (W;/W,g) for each scheduling algorithm 
is plotted in Fig.5. Algorithm A is shown supe- 
rior to all three knownschduling heuristics. 


Example 2: 


Given a task system (Il, € ,t\>T] of indepen-~ 
dent vector tasks, where II = {T, T,.T,,T, Ts} ; 
and t(T,) = 13, t(T,) = 8, t(T,) = 7, t(T,) = 11, 
t(T,) = 3. 

A parallel schedule fA for this task system 
is shown in Fig.6(a). 


Task qT, is partitioned 


into two subtasks with T = 11.25 and Ty a 175, 


11 
Similar partitioning is done for Ty with T,,=3.5 
and Ty» = 7.5. Vector tasks Ty» T, and T, are 


scheduled without partitioning. Such a schedule 


fA is defined by 


£,(T,) = {(10,12.251,P,),({0,2.751,P,)} 
£,(T,) = {((2.75,11.75],P,) } 

£,(T3) = {([0,8],P,)} 

£,(T,) = {([8,12.5],P ),(10,8.51,P,)} 
£,(T,) = {([8.5,12.51,P,)} 


The time spans of the four pipelines are t, = 


T2255 t, = 11.75, t, = 12.5 and t, = 12.5. The 


finish time of aN is De = 12.5, which is slightly 
higher than the optimal schedule with = 12 as 


shown in Fig.6(b). 


Scheduling A Tree Task System 
The heuristic we developed for a tree task 
system (1,8,t 57) is based on a tagged scheduling 


policy. First, we mark each vector task by a tag 
A. If a vector task Tj has no immediate prede- 
cessors, then set tag A(T.) * l. If T, has not 


been assigned a tag value and all the immediate 
predecessors of Ty namely Tir? yae eto tae have 


tag values, then assign ACT, ) < maxtA(T,1)y.005 

MT 4 +1. After tagging, we form a group of 
1? 

T, € I} , and & is the largest tag value (tree 

height). 


subsets E Eases skg, where Ey = fr, |Acr,) =i, 


Each E. consists of independent tasks 


which canbe processed concurrently. Obviously, 
Q 

ae Il _ e e - 
ey Bs , and E, a E. dif i# 4. Once we 


obtain oka see+ Eos we can apply Algorithm A, for 
each Ey» to obtain a parallel schedule for the 


tree system of n tasks. 
Algorithm B (for scheduling tree tasks): 


Step 1. Generate E,,E E. 


je92 59 * 3 
Step 2. For i = 1 to & step 1 do 


begin 
If lz, 12 2, call Algorithm A with the 
independent task system (E,,€,t 57) as 
input. If lz, | = 1, perfor vequal par- 


titioning. 
/The start time of the schedule for (E,, 


E,t 27) equals the finish time of the 
schedule for (Eft »t)/. 


end 
In Algorithm B, the time needed in Step 1 is 


of order O(n) as proved in [4]. For each indepen- 
dent task system (E,,€,t 7), the run time using 


Algorithm A has order O(k,), where k, is the num- 
ber of task in Eee At most m-1 partitions could 


occur in the schedule fae 
of Algorithm B has order O(n) and at most 2(m-1) 


partitions can be made in the schedule fo 


Thus, the complexity 


Theorem 4: 


Applying Algorithm B to a tree task system 
[1,6,t 57] over m pipelines, we obtain the follow- 


ing upper bound on the finish time Wp » where W, 


is the finish time of an optimal schedule for the 
same tree task system and 2 is the teee height. 


<s 
Op 


1 + ALD 2. (4) 


Example 3: 
Given a tree task system[II,0,t ,t]where I = 
{T}5+++sTg} follows the tree relationship shown 


in Fig.7(a). Suppose t= ts 3 = 4, T,= 


3 
7 8 or 
We want to schedule this tree task system on m=4 
identical pipelines. Using Algorithm B, we obtain 


17 2 T 


= 6, T 


2 


6, i= 8, te 8, te 257% =4,T 


E, = {T, sTosTasTyts E, = {T5sTpsTohs E, = {T, ; 
Ey, = Tg} at step l. ? 

At Step 2, a parallel schedule f, is genera- 
ted as depicted in Fig.7(b): Shaded areas indi- 
cate the idle times of pipelines. Tasks T,»T.> 
T,»T.»T, and Ty have been partitioned into sub-— 


tasks. The schedule f. is specified bv the follow- 


ing mappings: 


£,(T,) = {(10,31,P,)} 

£,(T,) = {(13,6-51,P)),({0,2.5],P,)} 

f, (73) = {([2.5,6.75] ,P,),(0,3.75] ,P)} 

f= {([3.75,71,P,),(10,6.75],P,)} 

f(T.) = 1C{7,11.751,P)),(17,121,P,), (17, 

8.25],P,)} 

ee = {({8.25,11.25],P,)} 

f= {(£7,12],P,)3 

£(15) = {({12,14.51,P,): 1s is 4} 

f(r.) {({14.5,16.51,P,): 13 i 4} 
The finish time of f, is Wd, = 16.5, within the 


same order of magnitude as the finish time W of 


an optimal schedule which has a lower bound 0.2 


13425% 
Conclusions 


Scheduling vector tasks in a multi-pipeline 
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vector processor is done in parallel in the pro- 
posed scheduling algorithms. Concurrent process- 
ing allows a vector to be partitioned into several 
subvectors for simultaneous execution by parallel 
pipelines. We have considered the overhead time 
associated with the pipelined execution of vector 
instructions. The parallel pipeline scheduling 
problem is shown NP-complete, which preludes us 
from insisting on optimal scheduling algorithms. 
Heuristic algorithms are thus developed for inde- 
pendent and tree task systems. If the average 
time span without partitioning is longer than the 
overhead time, high performance is expected in 
these heuristic algorithms. Our study can be 
extended to schedule vector task systems other 
than independent or tree tasks: The partitioning 
of a vector by time units can be also converted 
to partitioning by vector lengths. The proposed 
pipeline scheduling methodology should be very 
useful to those who are involved in the design and 
evaluation of supercomputers for parallel vector 
processing. 
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Fig. 1 The functional block diagram of a multiple-pipeline vector computer, 
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(a) The precedence graph .of a vector task system. 


(b) A parallel schedule f 


1 for the task system in (a). 


Fig. 2 The parallel scheduling of a task system of vector 
instructions. 


ct 


o 


Fig. 3 A worst-case example showing the schedule f, without 
partitioning in the proof of case 1 in Theorem 3. 


232 


t (k) 


t (1) t(k-1) | 
t:0 th aa ! 
| | 
| a | | 
P, | | | | 
° ! | | 
 ] | | | 
t +t_/2 
Pee a O | | 
Py t(1)tt,/2 | | 
| 
I 


t (k-l) +t) /2 


Fig. 4 A worst-case example showing the schedule f, with k partitions.in the 
proof of case 2 in Theorem 3. 
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Fig. 5 Performance comparison of Algorithm A and three known 
scheduling algorithms. 
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on | 12.25 


(a) A parallel schedule ft. for the task system in Example 2. 


(b) A known optimal schedule for the task system in 
Example 2. 


Fig. 6 Two parallel schedules for the task system in Example 2. 


(a) A tree task system 


zy 


B 


(b) A parallel schedule f,, obtained from Algorit 


Fig. 7 A tree task system and a parallel schedule 
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OF THREE A 


UTOMATIL 
PACKAG} 


Clifford N. Arnold 
Research and Advanced Design Laboratory 
Control Data Corporation 
St. Paul, Minnesota 55112 


Abstract - Eighteen kernels were used as a 
benchmark for three automatic  vectorizer 
packages. The resultant code was timed on 


Control Data CYBER 203 and CYBER 205 systems 
to assess the performance improvement produced 
by such automated techniques. Of the 18 
kernels, 16 were significantly transformed by 
at least one software package; thirteen ran 
faster on the CYBER 205 than highly optimized 
scalar code, and of those, eleven ran faster 
by at least a factor of three. A rough 
estimate of the programming effort to do the 
same vector transformation by hand showed that 
the automated software can speed the 
translation process by a factor of ten. 


1. Introduction 

The potential of very fast computational rates 
for vector and array processors necessitates 
code rewriting from scalar constructs to array 
or vector constructs. Array data structures 
can then be manipulated in parallel in 
multiple arithmetic units or streamed through 
pipeline units (which essentially eliminates 
the load/store time). The Control Data CYBER 
205 and the Cray Research Cray-1S computers 
possess vector rates which range from two to 
ten times their respective scalar speeds, and 
typically greater than five times the speed of 
a CDC 7600. Experience tells us that for a 
given code these high rates are approached 
only if the large majority of the work is done 
in vector mode. Otherwise the scalar time and 
thus the scalar rate dominates. Software 
tools that simplify the process of generating 
good vector code would help users make better 
utilization of these machines. 


Several compiler and preprocessor projects are 
in progress across the country to develop and 
perfect techniques to map scalar code into 
vector code. This function is called 
automatic vectorization. It should not be 
confused with scalar optimization which is a 
highly developed and mature technique. 


Several vectorizing compilers exist in the 
field, but they are generally considered early 
versions of an underdeveloped artform. Good 
examples are the compilers for the Texas 
Instruments ASC, the Cray-l, and the CDC CYBER 
200 Series. Research on automatic 
vectorization has gone far beyond these. For 
a review of such research see [1] and [2] and 
references therein. 
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| FORTRAN 


Two software packages that aid the programmer 
and compiler in generating vector code have 
come to my attention. One is from Kuck and 
Associates, Inc. and is called the Kuck 
Analyzer Package (KAP) [3]. The other is from 
Pacific Sierra Research [4][5] and is called 
the Vector and Array Syntax Translator 
(VAST). This paper reports on an experiment 
to assess potential performance improvements 
due to each of three automatic vectorizers: 
KAP, VAST, and a CDC CYBER 200 FORTRAN 
Compiler (hereafter referred to as CYBER 200 
FIN). Comments on usability and potential for 
productivity improvements are included. 


2. Procedure 

The test base consisted of 18 kernels of 
FORTRAN from the Lawrence Livermore Laboratory 
[6][7]. This benchmark, commonly called the 
Livermore Loops, was run through KAP and 
VAST. Both analyzers output FORTRAN source 
code with vector extensions. KAP output is 
not CYBER 200 compatible, so the syntax of the 
KAP output was manually converted to CYBER 200 
FORTRAN. Care was taken not to add vector 
constructs unless KAP indicated the need and 
vector constructs were always added where 
indicated by KAP. The conversion was intended 
to be a simple translation by a process that 
could easily be automated. The only semantic 
change was to declare three arrays ROWWISE. 


This change was indicated by KAP (though a 
straightforward syntax change could = also 
vectorize this construct). The conversion 


process was very time consuming and prone to 


errors. Most of the time was spent tracking 
down hard to discover typographical errors. I 
also needed to keep track of array 


declarations for new scratch vectors and had 
to spend time learning the CYBER 200 vector 
extensions that an automatic source code 
generator would have available to it. The 
process in all required many runs and six to 
eight person weeks. 


VAST, on the other hand, produces CYBER 200 
source code directly. The input 
source code was run through VAST and then 
through compilation and execution. A few 
simple user directives were added to _ the 
source code which appear as comments to the 
compiler, but are recognized by the VAST 
product as permission to collapse loops if 
possible. Four runs, over a period of three 
days were required to obtain best results. 


" 
\ 
\ 
Wh 


The original code was adjusted to exetite 
three times, varying the loop trip counts each 


time. For vector kernels, these counts turn 
into vector lengths showing the 
characteristics of vector performance for 


those loops. The ranges covered in the counts 
were designed to be broad and also bracket 
those used in the Livermore Magnetic Fusion 
Energy Procurement (1978). 


The actual experiment was based on repetitions 
of seven different runs; four on the CYBER 203 
and three on the CYBER 205. The original 
source code was compiled by CYBER 200 FIN with 
only scalar optimization and with both scalar 
optimization and vectorization. This source 
code was also processed through VAST followed 
by CYBER 200 FTN. 
was compiled by CYBER 200 FIN. 


3. Results 

VAST and CYBER 200 FIN take a comparable 
amount of time to execute the vectorization 
analysis and code translation, and _ take 
considerably less time than KAP. Since the 
packages are written in different languages, 
and KAP ran on a different computer, absolute 
comparison is difficult. Estimates show KAP 
would take about one decimal order of 
magnitude longer to do its analysis than VAST 
or CYBER 200 FIN would if all were run on the 
same computer. 


TABLE 1. 


CYBER 203 TIMINGS 


The converted KAP output. 


ELE ELE TED INE TOLD TELA DEED ADELE NESTE ARE LE LEENA SEBEL DIESE ESN EEE, 


CYBER 
Kernel Unvectorized 200 FIN VAST KAP 
1 9.6 22.4 23.0 22.7 
2 12.3 12.3 3.9 15.6 
3 5.9 15.7 15.7 15.8 
4 363 3.3 3.3 2.8 
5 7 9 7.9 7.9 3.5 
6 522 “SoZ 5.2 4.2 
7 17.0 24.4 24.5 24.4 
8 22.4 22.4 22.4 11.8 
9 13.0 18.1 17 .6 18.1 
10 8.6 11.9 12.0 10.2 
11 1.7 75 8.5 7.4 
12 2.9 36 8 i be Pp 35.6 
13 3.1 3.1 3.0 2.6 
14 5.5 5.6 55 4.8 
15 3.4 3.3 Seo 3.7 
16 0.6 0.6 0.6 0.6 
17 4.9 4.9 4.8 4.9 
18 8.3 1.0 14.9 17.2 
Notes: 


Computational performance of the kernels for 
the seven execution scenarios are summarized 
in Table 1. Table entries are in MFLOPS from 
the second pass of each run with trip counts 
as noted in Figure 4. Errors in the timings 
of the kernels were determined from 
repetitions of the runs. Typical 
uncertainties for a given kernel was 4 and 0.5 
microseconds for the CYBER 203 and CYBER 205 
respectively. The Expected Errors (MFLOPS) in 
Table 1 are based on the speed of the fastest 
execution scenario for each kernel. 


When automatic vectorization is applied, all 
kernels except 16 and 17 show significant 
performance changes relative to the 
unvectorized run. Kernels 5, 6, 13, and 14 


show the smallest variation. All execution 
scenarios for all variations of trip counts 
for these four kernels register performance 
within 56 percent of the unvectorized run. Of 
the remaining 12 kernels, only one, kernel 8, 
did not improve over the unvectorized 
performance for any scenario at some trip 
count. For the CYBER 205 runs, a majority of 
the kernels show impressive gains due to 
automatic vectorization. Kernels 2, 4, 5, 6, 
8, 11, 13, 14, 15, and 18 discriminate these 
vectorizers' capability to uncover’ vector 
constructs. 


Below I summarize for each kernel the 
automatic vectorization that has occurred. 
Note also the difference between the CYBER 203 


TIMINGS FOR THE LIVERMORE LOOPS 


CYBER 205 TIMINGS 
ARSED ENR EI PETRELLI NE LE TCL DTI ELIE ETC TONER, 


Expected 

CYBER 200 FIN VAST KAP Error 
121.5 121.3 120.2 1.5 
12.5 16.2 76.9 1.0 
78 .8 73.9 76 .6 2.9 
3.3 3.3 17.0 0.1 
7 9 7.9 5.7 0.1 
5.2 5.2 6.1 0.1 
146.0 146.9 144.7 2.1 
22.4 22.4 15.9 <0. 05 
80.7 81.0 81.6 0.6 
29 .6 29 8 30.7 0.3 
8.4 8.5 8.2. 0.4 
73.0 77.4 76.0 7.9 

3.1 3.0 4.4 <0.05 
6.9 6.9 5.3 0.1 

3.4 3.4 19.3 <0.05 
0.6 0.6 0.6 <0. 05 
4.9 4.9 4.9 0.2 
4.2 42.4 50.1 0.3 


1) Entries are in million floating point operations per second (MFLOPS). 


2) Results are from the second pass with trip counts noted in Figure 4. 
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and CYBER 205 performance. Since these two 
machines have identical scalar units, the 
timing differences are due to the vector 
calculations, and therefore help to point out 
how much of each kernel's performance is due 


to vector manipulation. 


Kernel 1: 1-D Hydrodynamics Excerpt 
This is a simple loop which all three 
vectorizers succeeded in  vectorizing. 


Two scalar broadcasts make this loop run 
impressively on the CYBER 205. 


Kernel 2: Unrolled Inner Product 
This DOT PRODUCT is camouflaged by being 
unrolled into a loop summing five partial 
products. CYBER 200 FIN generates scalar 
code. VAST generates five vector 
temporaries using gather instructions and 
then does five vector additions and 
multiplications and lastly the vector 
sum. KAP recognizes the DOT PRODUCT and 


generates the single CYBER 200 macro 
instruction. 
Kernel 3: Inner Product 


All three vectorizers recognize this as a 
vector DOT PRODUCT and generate the CYBER 
200 macro instruction. 


Kernel 4: Banded Linear System 
This loop is nested to level 2, but is 
neither tightly nested nor. trivially 
collapsible. VAST leaves the loop alone 
and CYBER 200 FIN generates scalar code. 
KAP inverts the loops and discovers a 
vector DOT PRODUCT for which it needs to 
gather array "A". KAP leaves the loop on 
“J" as a scalar loop and finishes by 
gathering array "X", doing the final 
vector multiply, and scattering array "X". 


Kernel 5: Tri-Diagonal Below 
Diagonal 


This loop has a 


Elimination, 


scalar recurrence on 
array "X", and therefore has_ been 
unrolled in the original code for 
improved scalar performance. VAST leaves 
the loop alone and CYBER 200 FIN 
schedules efficient scalar code. KAP 
rerolls the loop, recognizes the first 
order recurrence and generates a macro 
instruction, for which CYBER 200 FTN 
generates a STACKLIB call. KAP also 
factors a vector multiplication out of 


the loop. 
Kernel 6: Tri-Diagonal Elimination, Above 
Diagonal 
Aside from the backward loop counter, 


this kernel is nearly identical to kernel 
5. VAST leaves the loop alone and CYBER 
200 FIN schedules efficient scalar code. 
KAP rerolls the loop, recognizes’ the 
first order recurrence, and generates a 
macro instruction (again, a CYBER 200 
STACKLIB call). The multiplication that 
could be vectorized and factored out of 
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the loop is instead used in the macro 
instruction. 


Kernel 7: Equation of State Excerpt 
This loop is vectorized by all _ three 
vectorizers. Eight broadcast triads for 
the seventeen operands make this a very 
impressive loop for the CYBER 205. 


Kernel 8: P.D.E. Integration 
This is a good example of a loop with 
possible linear recurrences and many 
calculations that are not coupled to the 
recurrences. These later mentioned 
calculations can be factored into their 
own loop and thereby transformed into 
vector instructions. Neither VAST nor 


CYBER 200 FIN does this and _ instead 
scalar code is generated. Because the 
loop is computationally dense with 


temporaries, effective scheduling of the 
256 word register file permits very fast 
scalar execution. KAP factors many, 
though not all, allowable calculations 
out of the loop. This leaves about half 
of the calculations in the scalar loop. 


Kernel 9: Integrate Predictors 
This loop steps along the rows of a two 
dimensional array, instead of the 
columnwise ordering to which CYBER 200 
FORTRAN defaults. VAST and CYBER 200 FIN 
generate gather instructions, the vector 


computation, and then scatter 
instructions. KAP explicitly notes the 
parallel structure of the computation, 


implies the ROWWISE construct, but cannot 
legally implement it on the CYBER 200 
without unpredictable side effects. Here 
I make the choice of using ROWWISE in the 
original code because I can rule out the 
side effects. Then all three vectorizers 
discover the same solution as a simple 
vector expression. 


Kernel 10: Difference Predictors 

This kernel has the same rowwise 
characteristics of kernel 9. It is 
vectorized by all three vectorizers in 
the same way when the ROWWISE statement 
is used. Again, KAP implies the ROWWISE 
usage whereas VAST and CYBER 200 FIN do 
not. 


Kernel 11: First Sum 
This loop is recognized by both CYBER 200 


FIN and KAP as a linear recurrence, and 
the appropriate instruction macro 
(STACKLIB call) is generated. VAST 


leaves this code alone. 


Kernel 12: First Difference 
All three vectorizers recognize this loop 
as vector subtraction. 


Kernel 13: 2-D Particle Pusher 
This kernel contains array references in 
which the array elements are loaded and 


‘stored in data dependent mappings, that 
is specifically not by fixed strides. 
VAST does not change the code. CYBER 200 
FIN generates scalar code. KAP gathers 
and scatters by index lists for all data 
dependent mappings except array "H" which 
is left in a scalar loop. Thus, most of 
the arithmetic is done in vector mode. 
Note that array "H" can be 
“gathered/scattered" and calculated in 
vector mode only if it is a permutation 
list, that is, a list with no duplicate 
index references. Such a case is not 
guaranteed here. 


Kernel 14: 1-D Particle Pusher 


This loop is analogous to kernel 13 in 
having data dependent referencing of 
arrays. Again CYBER 200 FIN and VAST 
perform no vectorization. KAP vectorizes 
a smaller percentage of this kernel than 
of kernel 13. The speed reduction of the 
vector version relative to the scalar 
version suggests that the register file 
cannot be scheduled as effectively as 
when the entire loop is left as scalar. 


Kernel 15: Casual FORTRAN 


CRRA KKKERRERRAAEKAKRAREAKRE RAK AKKREREAKARRAARREKERETKKRSKRAATARARAER AS 


VAST and CYBER 200 FIN’ find no 
vectorization in this’ kernel. KAP 
replaces the IF statements with logical 
tests to generate control bit vectors. 
The two dimensional arithmetic 
expressions in the kernel are converted 
to two dimensional vector expressions 
with control stores. Figures 1 and 2 
show the original FORTRAN and_ vector 
FORTRAN versions of the KAP output 
respectively. Note the discussion in 
Section 4 below. | 


aa KERNEL 15 CASUAL FORTRAN. DEVELOPMENT VERSION. 

c CASUAL ORDERING OF SCALAR OPERATIONS IS TYPICAL PRACTICE. 
c THIS EXAMPLE DEMONSTRATES THE NON-TRIVIAL TRANSFORMATION 
: REQUIRED TO MAP INTO AN EFFICIENT MACHINE IMPLEMENTATION. 
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99915 


SAVE=RTC(SAVDMY ) 
Do 99915 IGLM=1, I TIMES 
CALL CLRSTK 
NR= 7 
NZ=t 25 
ARz 5.7/3. 
BR= 7.7/5. 
06 45 Js 2,NR 
do 45 K #® 2,N2Z 
IFC J-NR)31,30, 30 
VY(K,J)# 0.0 
Go To 45 
TFC VH(K,J¢1) -VH(K,J))33,33,32 
T= AR 


T= BR 
IFC VFC(K,J) -VFCK-1,3))35, 36, 36 


R= AMAX1( VH(K-1,J3), VH(K-1,J3¢1)) 
S= VF(K-1,J) 

GO Te 37 
Re AMAX1( VH(K, J), VH(K, J41)) 
Se VF(K,J) 


VY(K,J)# SORT ( VGC(K, J) a22 *R#R)8T/S 
IFC K-NZ)40,39,39 
VS(K,J)# O. 
6O TS 45 
IFC VFCK, J) oVFCK, 5-1)941,42, 42 
Re AMAX1T( VO(K,J-1), VOCK41,J5-1)) 
S= VFCK,J-1) 
Ts BR 
66 TO 43 
Re AMAX1( VG(K, J), VG(Ke1,J3)) 
Se VF(K,J) 
AR 


VS(K,J)# SQRT( VH(K,J)a22 +ReR)#T/S 
CONTINUE 
CONTI NUE 
$DT(15) #RTC(SAVDMY) 


Figure 1. Kernel 15-Original Code. 
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SAVEC = RTCC(SAVDMY) » 

NR = 7 

NZ = 25 

AR = $.73. 

BR = 7./5. 
DS IV = 1,1TIMES 
CALL CLRSTK : 
JNX(1°6) = SEQ(1,6,1)+41 
FV1X(1:6) = JNX(1:6)-7 
MFIC1T:6) = FV1IX(1:6).GE.0 
MF2(01:6) = FV1X(1:6).LT.O 
KRX(1:24) © SEQ(1,24,1)+41 
WHERE (MF1Q1¢1: 1: 6)) 

VY(2:25,2:7. - O.f 
WHERE (MF203(1:24,1:6)) 

x FV2XQ2(1:24,1:6) = VH(2:25,3:6)-VH(2: 25, 2:7) 
MF3Q4(1:24,1:6) = FV2xX02(1:24, 1: 6).GT.OEO. AND. MF2Q3(1:24,1:6) 
MF4Q5(1:24,1:6) ® FV2XQ2(1:24,1:6).LE.0E0.AND.MF2Q3(1:24,1:6) 
WHERE (MF3Q4(1:24,1:6)) 

TBXQ6(1:24,1:6) = AR 
WHERE (MF40S5(1:24,1:6)) 

TBXQ6(1:24,1:6) = BR 
WHERE (MF2Q3(1:24,1:6)) 

FV3XQ7(1:24,1:6) = VF(2:25,2:7)-VF(1:24,2:7) 
MF508(1:24,1:6) = FV3XQ7(1:24, 1: 6).LT.OEO. AND. MF2Q3(1:324,1:6) 
MF6Q9(1:24, 1:6) = FV3XQ7(1:24,1:6).GE.OE0. AND. MF203(1:24, 1:6) 
WHERE (MF5Q8(1:24,1:6)) DO 

RBXQ10(01:24,1:6) = AMAX1(VH(1:24,2:7),VH(1: 24,3: 8)) 
$CXQ1101:24,1:6) = VFC1:24,2:7) 

END WHERE 
WHERE (MF6Q89(1:24,1:6)) D6 

RBXQ10(1:24,1:6) = AMAX1(VH(2: 25,2: 7), VH(2: 25, 3:8) ) 
SCxXQ1101:24,1:6) = VF(2:25,2:7) 

END WHERE 
WHERE (MF 203(1:24,1:6)) DO 

VY(2:25,2:7) = SQRT(VG(2: 25,2: 7) *#2*RBXQ10(1: 24, 1:6) *RBXQ10(1:24 

x »13:6))TBXQ6(1:24,1:6)/SCXQ11(1:24,1:6) 

FV4012(1:24,1:6) = KRXQ13(1:24,1:6)-25 

END WHERE 

60 MF7Q14(1:24,1:6) = FV4Q12(1:24,1:6).GE.0.AND.MF2Q03(1:24,1:6) 
MF8O15(1:24,1:6) = FV4Q12(1:24,1:6).LT.0.AND.MF203(1:24,1:6) 
WHERE (MF7Q14(1:24,1:6)") 
VS(2:25,2:7) = 0. 

WHERE (MF8Q15(1:24,1:6)) 

FV5016(1:24,1:6) = VF(2:25,2:7)-VF(2: 25,1: 6) 
MF9017(%:24,1:6) = FV5016(1:24,1:6).LT.OEO.AND.MF8O15(1:24,1°6) 
MF1Q18(1:24,1:6) = FV5Q16(1:24,1:6).GE.0E0.AND.MF6Q15(1:24, 1:6) 
WHERE (MF9017(1:24,1:6)) DG 

RAXQ19(1:24,1:6) = AMAX1(VG(2:25,1:6),VG(3: 26,1:6)) 
SBXG20(1:24,1:6) # VF(2:25,1:6) 

TAXG21(01:24,1°:6) = BR 

END WHERE 
WHERE (MF1016(1:24,1:6)) DO 

RAXQ19(1:24,1:6) @ AMAXI(VG(2: 25,2: 7),VG(3:26,2:7)) 
SBXG20(1:24,1:6) = VF(2:25,2:7) 

TAXG21(1:24,1:56) 2# AR 


END WHERE 
62 WHERE (MF8O15(1°24,1:6)) 
x VS(2:25,2:7) = SQRTCVH(2: 25, 2: 7) **2+RAXQ19(1:24,1:6) sRAXQIS 
x (1:24,1:56))*TAXQ21(1:24, 1:6) /SBXQ20(1:24,1:6) 
65 ENDDS 


SDT(15) = RTCCSAVDMY) 

ARRAY MFIQI(CI © 24,J 2 10) = MFICJ) 
ARRAY Fv2xa2(1 # 10,J3 © 10) © Fv2XxX(J,1) 
ARRAY MF2Q3(1 = 24,J # 10) = MF2(J) 
ARRAY MFS3O4(1 = 10,J = 10) = MF3(J,1) 
ARRAY MF405(1 = 10,J = 10) = MF4ACJ,1) 
ARRAY TBXQ6(I = 10,J = 10) = TBX(J,1) 
ARRAY FV3XQ7(1 2 10,3 = 10) © FV3X(J,9) 
ARRAY MFSQ8(1I = 10,J 2 10) = MFS(J,1) 
ARRAY MFGQ9(I = 10,J = 10) = MFG6(J,1) 


ARRAY RBXQ1IO(1 © 10,3 = 10) = RBX(J,1) 
ARRAY SCXQ11(1 # 10,J # 10) = SCX(J,1) 
ARRAY FV4012(1 = 10,J ® 10) © FV4X(J5,1) 
ARRAY KRXQ13(1 = 10,3 = 6) = KRX(1) 
ARRAY MF7Q14(1 © 10,J = 10) = MF7(J,1) 
ARRAY MF8Q15(1 © 10,J © 10) = MF8&(J,1) 
ARRAY FVSQ16(1 = 10,J ® 10) & FV5X(J,1) 
ARRAY MF9Q17(1 = 10,J = 10) © MFO(J,1) 
ARRAY MF1Q18(1 # 10,J # 10) = MFIO(J,1) 
ARRAY RAXQ19(1 # 10,J #® 10) ® RAX(J,19) 
ARRAY SBXQ20(I = 10,J ® 10) = SBX(J,1) 
ARRAY TAXQ21(1 # 10,J # 10) * TAX(J,1) 


Figure 2. Kernel 15-KAP Vector FORTRAN. 


Kernel 16: MONTE CARLO Search Loop 
No vector manipulations are done by any 
of the three vectorizers. 


Kernel 17: Implicit Conditional Computation 
No vector manipulations are done by any 
of the three vectorizers. 


Kernel 18: 2D Hydrodynamics Fragment 
This kernel has three loops nested _ to 
level 2. It is a classic example of a 
“picture-in-frame” computation. The mesh 
points on the frame are _ boundary 
conditions to be skipped in this 
computation. Calculations are to .take 
place for all the points in the picture. 
CYBER 200 FIN will not collapse this 


construct into 1-dimensional vectors 
because it will not generate the control 
stores to avoid the frame. CYBER 200 FIN 
instead vectorizes the inner loops which 
in this case yield vectors of length 5. 
Both VAST and KAP generate a collapsed 
version of the two dimensional loops and 
the bit vectors to control the stores. 


4. Discussion 
State-of-the-Art Automatic Vectorization 


Even a casual glance at the descriptive 
summaries listed above shows clearly that KAP 
did the most vectorization of the three tools 
studied. KAP, in fact, vectorized a superset 
of that done by VAST and CYBER 200 FTN 
combined. A look at the techniques used by 
each should predict this point. VAST and 
CYBER 200 FIN rely on pattern recognition of 
source code within an individual Do Loop to 
detect constructs for vector conversion. KAP 
does most of its pattern recognition within 


vector dependence graphs. Such a 
representation traces the data dependence of 
individual array elements and _ the _- flow 
dependence of sequences of source’ code. 
Parallelism is not restricted to Do Loops, and 
Do Loop analysis is more_ general. The 
resulting representation is closer to. the 


sense of the flow of the calculations than the 
original source code in all but the most 
carefully designed FORTRAN programs. 


In the Livermore Benchmark, kernels 4, 8, 13, 
14, 15, 16, 17, and 18 represent significant 
tests for state-of-the-art vectorizers, circa 


1981. The remaining kernels are essentially 
“one-liners”. Kernels 16 and 417 require 
knowledge of the input data before 
vectorization transformation is reasonable. 


Of the significant tests remaining, KAP solved 
the problem optimally for all, save slight 
improvements on kernels 8 and 15. This last 
comment is made relative to the best manual 


vectorization effort. Kernel 15 is a_e good 
example of taking very stylized, though not 
uncommon code, working out the data 
dependencies, and generating an_ entirely 


vectorized kernel from something that did not 
initially look like vector code at all. I 
have extracted the two forms of KAP output, 
Figures 1 and 2 so the reader can see how the 
transformation is done. 


From discussions I have had with compiler 
writers, vectorization experts, supercomputer 
users, and the CYBER 200 Compiler Development 
group, there is a general consensus that KAP 
represents the state-of-the-art in automatic 


vectorization technique at this time. 


Usability 


Both the CYBER 200 FORTRAN Compiler and VAST 
are eminently usable. Simply add one command 
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line for either into an execution procedure 
and the rest is automatic. VAST allows the 


user to add some hints in the form of 
"directives" which aid the tool in its 
analysis. The VAST package also has easy to 


read output which can help the user in several 
ways. Figure 3 shows the kernel 5 excerpt 
from the VAST output. It describes clearly 
what problems it had converting the loop. 
Often the user can then recognize quickly what 
could be improved and what should be left 
alone. VAST was designed to be used in this 
type of interaction, with the user and tool 
sharing their expertise in an iterative mode 
to optimize code. The second part of this 
kernel, which is outside of the timing loop, 
shows the user the appropriate CYBER 200 
FORTRAN vector syntax for converted code. 
This is a good learning device for the novice 
vector programmer. VAST shows that a 
vectorizing preprocessor can be incorporated 
invisibly, and that the output can be an aid 
to the experienced or inexperienced programmer. 


The effort required to do a manual conversion 
of scalar code to vector code is estimated 
conservatively by the time I used to convert 
KAP output to CYBER 200 Vector FORTRAN. VAST 
demonstrates that the speed up in terms of 
programmer effort for automatically generated 
code is, in this case, at least a ten to one 
ratio over the manual effort. Thus, where KAP 
shows the vectorization state of the art, VAST 
shows how much programmer time can be saved in 
the automatic conversion of code. 


267. 
368. 
369. 


TRI-~OIAGCNAL ELIMINATICNs BELOW DIAGONAL 
IJK =tLOOPS (5)=2 

370. CALL CBWJTIMFCISAVE) 
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3726 Ctl VINIT 
eee TRANSLATION GCIAGNCSTIC -- LINE 372: 

ONLY ASSIGNMENT STATEMENTS CAN BE TRANSLATED TO ARRAY SYNTAX 

373. OC 5 Ts2sTdKs3 

374. X(P 2 = ACT D*CVYCT 2 =XCI-1)) 
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383. CEEREEEEEEE+4E4444444 
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3051 CONTINUE 

387: OC LOCP SUCCESSFULLY TRANSLATED TO ARRAY SYNTAX 

LOOP LABEL -- 3051 LOGP INDEX -- J] 


+444 LINE 
#444 PEGINNING LINE =~ 336 
288, WEITE (Che 91C)KKK CK SUM 
389. 


Figure 3. VAST Output for Kernel 5. 
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Performance Considerations 


Throughout this paper the motivation for 
vectorization is to harness the potential 
performance improvements of the vector and 
array processors. Does automatic 
vectorization in fact do this? The answer is 
not a simple yes or no. Figure 4 shows the 
performance profiles for the eighteen kernels 
with varying trip counts. The profiles 
represent for each kernel the fastest vector 
version of that code as choosen from the three 
vector solutions. Kernels with flat profiles 
(kernels 5, 6, 8, 11, 13, 14, 16, and 17) are 
dominated by scalar computation. Except for 
kernel 8 their characteristic computation rate 
is less than 10 MFLOPS. The remaining curves 
have profiles which have increasing 
performance with increasing trip count and an 


asymptotic behavior for large trip counts. 
This is the characteristic shape of vector 
code executing on the CYBER 205. The 


computation rate of these kernels typically 
exceeds 50 MFLOPS, and in all cases exceeds 10 
MFLOPS. 


Vectorization did not produce the fastest 
computation rates for all the kernels. The 
fastest solution on the CYBER 205 for kernels 
5, 8, and 14 was by CYBER 200 FIN which did 
not detect any vectors for these kernels. The 
timings in Table 3 for the CYBER 203 show 
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Automatic Vectorization) 


Figure 4. 
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additional dramatic examples of vectorization 
slowing code down. These occurrences depend 
both on the source code and on a machine's 
relative vector versus scalar performance 
capability. For example, when vectorization 
adds a substantial vector set-up penalty it 
can negate the advantage of vector 
computation. In such cases the compiler 
should generate scalar object code. This 
process is called vector optimization and is 


machine dependent and sometimes data 
dependent. It should not be confused with 
vectorization. Note Table 1 for ranges in 
performance due to the three  vectorizer 
packages. 


The goal of vector optimization is to generate 


the best object code for ae given vector 
computation. Using the CYBER 205 as an 
example, vector execution speed depends 
strongly on 
1. vector length, 
2. vector store density for bit vector 
control stores 
3. vector set-up due to 
a. gather/scatter 
b. compress /mask-merge 
4. generation of index and control bit 


vectors. 


The optimized object code must reflect the 
expected execution time which as shown above 
may depend on many parameters. Such vector 
optimization must be addressed directly in 
future vectorizers to guarantee good 
performance. 


Figure 5 shows the comparison for best vector 
speed (due to automatic vectorization) versus 
the best scalar’ speed. Because of the 
maturity of scalar optimization relative to 
automatic vectorization, the scalar times are 
near the best. The vector ranges on the other 
hand represent for each kernel the fastest 
vector version for that code as in Figure 4. 
These ranges are not guaranteed to be the best 
vector speed because the decision on how much 
to vectorize a given loop is not based on 
performance considerations. Figure 5 shows 
impressive performance improvements due_ to 
vectorization of very different types of codes. 


Lastly, the following questions should be 
asked. Is not the effective speed of the 
computer determined by the code in which the 
computer spends the most time, which is often 
the slow scalar code? Has vectorization 
improved anything if scalar code remains and 
dominates? The answer is a definite yes. The 
code may still be scalar dominant, and yes, we 
have also improved things. If a computer has 
a very fast vector unit, relative to its 
scalar unit, then the percentage improvement 
is equal to the percentage oof = scalar 
computations converted to vector computations, 
whether by a compiler, or human, or 
preprocessor. If the scalar code can be 
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reduced so that it no longer dominates’ the 
computation, then the average computation rate 


represents the impressive vector rate. Though 
this latter goal may be unlikely for a 
computer center workload as a whole, many 


users will enjoy the benefit. So the more an 
automatic tool does for you, the more _ the 
speed up you get. In the worst case, the 


improvement is linear with the vectorization 


success. 
5. Summary: A Case for Automatic 
Vectorization 


I have studied three automatic vectorization 
packages and the performance enhancement each 
brings to the Livermore Loops’ Benchmark. 
Performance speed-up for several kernels are 
very impressive, essentially achieving the 
theoretical speed-up. Out of the 18 kernels, 
automatic vectorization improved the CYBER 205 
times by the following indicated factors over 
scalar performance: 


3 kernels (1, 3, 12): 
7 kernels (2, 4, 7, 9, 11, 


> 10 * scalar 


15, 18): > 5 * scalar 
1 kernel (10): > 3 * scalar 
2 kernels (6, 13): >1.1 * scalar 
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Scalar and Vector Performance Summary for the Livermore Kernels 


If these 18 kernels are used as a_e system 
workload, each kernel equally weighted, 
automatic vectorization would increase’ the 


total throughput by 70 percent. Note that the 
scalar time still dominates the execution time 
(60 percent scalar, 40 percent’ vector). 
Kernel 16 dominates this time (59 percent of 
total) because it is so much slower than the 


other 17 kernels. If this kernel were 
eliminated from the sample, automatic 
vectorization would increase the total 


throughput by 170 percent, or a factor of 2.7 
times in speed. In this case the scalar time 
is less than half the total (3/7 _ percent 
scalar, 63 percent vector; comparable to 11 of 
17 kernels vectorized). Thus, as an 
approximate figure, automatic vectorization 
will give about 50 percent vectorization by 
computations count, a speedup of 100 percent 
in time, for code like the Livermore 
Benchmark. Note that in the best case when a 
code is full of loops analogous to kernels 1, 
2, 3, 4, 7, 9, 10, 12, 15, and 18, automatic 
vectorization will bring a manyfold 
improvement over the scalar execution. 


The KAP in all cases did equal oor _ more 
vectorization than the CYBER 200 FIN or VAST, 
and is generally agreed to represent the state 
of the art in analysis technique. Of the 170 


percent speed-up quoted above, 66 percent can 
be attributed to CYBER 200 FIN vectorization, 
an additional 11 percent to 
vectorization, and an additional 93 percent to 
KAP. This assumes that as information 
pertaining to the vectorizability of a kernel 
is increased the kernel will at worst run at 
the same speed and may run faster; that is, 
the best vector solution is deterministic. 
This point has yet to be proven. 


The amount of programmer time and effort that 
an easy to use vectorizer can save over an 
equivalent manual solution is demonstrated by 
comparing the time to use VAST as opposed to 
KAP. Translating manually from KAP vector 
output to CYBER 200 FORTRAN’ took ten _ to 
fifteen times longer than the 2 1/2 days to 
get a best effort from VAST. The time period 
for the KAP work was not an attribute of KAP, 
but of the manual process for translating and 
debugging code. VAST shows that there is 
nothing particularly difficult in generating 
CYBER 200 FORTRAN automatically. There is a 
lesson to be learned when I spend 6 weeks on 
something that need only take 2 1/2 days. 


Thus, as a programmer’ productivity aid, 
automatic vectorization has high potential. 
Presently a tool does not exist that has the 


vectorization potential of KAP and_ the 
usability of VAST. 
The most salient criticism oof automatic 


vectorization is that it does not address the 
total problem as stated by those users who 
need the best conceivable vector solution. 
That is, it does not rewrite code on a global 
scale, nor reorganize the mathematical 
approach. In simple terms, those who need the 
best solution find they need to rework the 
problem from scratch. There are a lot of 
human factors involved in such an effort. The 
required imagination and thinking cannot be 
replaced. Interactive vectorization tools 
could help program development. Such tools 
would have to be able to respond to queries on 
the vectorizability of a kernel and _ the 
resulting side effects to other routines. It 
should be able to predict both the static and 
dynamic memory requirements of program 
changes. Graphical response showing how the 
calculations proceed through the grid space 
would be helpful to the user who interacts 
well with chalk board tactics. 


Past experience shows that such tools will 
encourage the user's imagination and insight, 
and thus truly help the code development 
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effort. There is nothing in this scenario 
that appears to be beyond the_ technical 
ability of near future vectorizers. As vector 


and array processors become more pervasive, 
the demand for such software will rapidly 
increase. Therefore, I see a promising future 
for automatic vectorization software and the 
users who must rely on it. | 
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RESULTS OF PARALLEL PROCESSING A LARGE SCIENTIFIC PROBLEM 
ON A COMMERCIALLY AVAILABLE MULTIPLE-PROCESSOR COMPUTER SYSTEM 


Robert Hiromoto 
Computing Division 
Los Alamos National Laboratory 
Los Alamos, New Mexico 87545 


Abstract 


Presented is a summary of a parallel- 
processing experiment designed to study the feasi- 
bility of doing large-scale scientific calcula- 
tions on multiple-processor architectures. This 
particular experiment was performed on a UNIVAC 
1100/80 computer system, whose architecture (con- 
figured about a common memory) eliminates the need 
for data transmission between processors. The al- 
gorithm used in the experiment is a particle-in- 
cell (PIC) method; it was selected because of its 
large, independent computational tasks that are 
adaptable to this particular parallel-processing 
architecture. Timing results for the parallel- 
processing version of this algorithm using one, 
two, three, and four identical processors are 
given and are shown to have promising speedup . 
times when compared to the overall run times meas- 
ured for a single processor version of the algo- 
rithm. 


Summary 


This paper presents the results of an inves- 
tigation concerning the feasibility of parallel 
processing a significant scientific problem on a 
commercially available multiple-processor system. 
Of particular interest is the computational speed- 
up as a function of the number of processors em- 
ployed. The algorithm used in this experiment is 
a particle-in-cell (PIC) method for simulating the 
electrostatic interactions of a collisionless 
plasma [1]. This particular problem represents a 
large scientific calculation of interest to the 
Los Alamos National Laboratory as well as a class 
of algorithms exhibiting limited vector capabili- 
ties. Figure 1 illustrates the multiple/single- 
thread PIC algorithm implemented in our experiment 
with threads 1 and 4 parallel processed. 


An initialization stage of the algorithm sets 
up the aggregate of plasma particles positions, 
velocities, and corresponding charge distribution. 
For each discretized time step (At), the main com- 
putational loop advances the particle's position 
and velocity through the effects of the electros- 
tatic interactions arising between particles and a 
uniform, background electric/magnetic field. 
Throughout this loop, a particle-in-cell method is 
employed to decompose a region of space into a 
collection of cells [2]. These cells are then 
used to track particle movement, and assist in the 
evaluation of the total charge distribution (C), 
the electrostatic potential (6), and the electric 
field (E) under which all particles are accelerat- 
ed (pushed). 


Our experiment was successfully implemented 
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and timed on a UNIVAC 1100/80 multiple-processor 
computer system at Sperry UNIVAC, Roseville, Min- 
nesota. A simplified diagram of the UNIVAC 100/80 
is given in Fig. 2. A principal feature of this 
system is the ability of all processors to execute 
a single instruction stream in parallel upon data 
in common memory. This feature is supported by 
the Cobol compiler but not by the Fortran com- 
piler. Software designed by David Hammer (a 
UNIVAC consultant with Sandia National Labora- 
tories, Albuquerque, New Mexico) enabled a single 
copy of the PIC code written in Fortran to be im- 
plemented in a parallel-processing mode. The 
multiple-processor computer system may be config- 
ured with one, two, three, or four processors. A 
further characteristic of the UNIVAC 1100/80 ar- 
chitecture is that no one physical processor may 
always have exclusive access to the execution of 
given activity. On the contrary, depending upon 
the length of the task itself, all the physical 
processors may have time-shared portions of the 
activity's entire execution stream. A distinc- 
tion, therefore, is made between activities and 
processors. 


Because of software addressing limitations, 
the PIC code was restricted to a maximum of 
262,000 decimal words of total memory. For each 
particle, five data quantities (two for position 
and three for velocity) were required. Three mesh 
quantities, constituting a 34 X 34 mesh size, were 
required and duplicated for a maximum of eight 
particle-push activities. A total of 37,000 par- 
ticles were initiated for processing, requiring 
213,000 words of memory (particle plus mesh data). 
An additional 47,000 words of memory were used for 
the instruction stream, address mapping and ac- 
tivity synchronization scheme. 


Table I gives the results of the experiment. 
The speedup values are the ratios of the overall 
execution time of a single-thread version of PIC 
running on one processor to the overall execution 
time of a multithread PIC code running on two, 
three, and four processors. We found that a max- 
imum speedup of three was attained when using four 
processors with four activities spawned for each 
multiple task. 


Because the multithread PIC was not totally 
parallel (see Figure 1), the speedup for four pro- 
cessors may not indicate the full potential of the 
PIC algorithm. The times recorded and used for 
the parallel-processing speedup calculations were 
based on wall clock times, with timing runs made 
in a dedicated mode. Because of time constraints 
and limited resources, actual CPU times were not 
measured. 


We conclude that significant computational 
speedups are strongly suggested by our results for 
multiple-processor environments similar to the 
UNIVAC 1100/80 computer system. We further note 
and caution that our results are highly coupled to 
the particular algorithm and the multiple- 
processor architecture selected. 
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Figure 1. A multithread version of PIC as imple- 
mented on a UNIVAC System 1100/80 with 
two parallel-processing tasks (1 and 
4), where A_ = total number of parallel 
activities (multithread), n = total 
number of particles, n. = number of 
particles for activity i, C = total 
charge (distribution), and C, = charge 
computed for activity i. 7 
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Abstract 


Kernel-control tailoring is a method of 
preprocessing an ordinary sequential program for 
parallel execution. The preprocessing is intended 
to remove a substantial amount of the control 
dependence between operations in the program 
through deletion conditional branches’ and 
unrolling of loops. The method is applicable to 
existing programs of practical size. Preliminary 
results from tailoring of a sample of FORTRAN 
programs are reported. 


of 


Introduction 


There are three classes of dependence between 
operations that inhibit parallel execution: data 
dependence, in which one operation must wait for 
an operand to be computed as the result of another 


earlier operation; resource dependence, in which 
operations must wait on the availability of 


memory, functional units, or other resources; and 
control dependence, in which an operation must 
wait until it is known to be on the actual path of 
program execution before being executed. Various 
studies have indicated that the effect of control 
dependence in inhibiting parallel execution is a 
central problem. Riseman and Foster [1] isolated 
the problem in a classic study. For a sample of 
real programs run on an idealized machine (no 
resource dependence) they showed that programs 
with an average potential speedup of 51 to 1 
(parallel over sequential execution), when only 
data dependence is considered, in fact could 
realize only an average speedup of less than 2 to 
1, due to the effect of control dependence. The 
goal of this study is to obtain more extensive 
data on the magnitude of this control dependence 
in large scientific FORTRAN programs, and then to 
investigate a novel solution to the removal of 
some of this control dependence. The new method 
is called kernel-control tailoring and is based on 
the use of kernel=-control decomposition (Pratt 
{[2]) to preprocess the program to determine the 
control path, leaving a simplified program with 
less control dependence to be executed on the 
parallel computer. 
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Control Dependence in FORTRAN Programs 


The initial question of this study was to 
obtain better data on the amount of control 
dependence in a realistic sample of FORTRAN 
programs. The usual notion of basic block is 
useful. A basic block is a sequence of straight 
line code that includes no conditional branching, 
looping or subprogram calls. Within a basic 
block, execution of operations can be performed in 
parallel, inhibited only by data dependencies. 
Within a basic block, every operation essentially 
has the ‘same control signal input, which is the 
control signal sent when the branch point 
preceding entry to the block is passed during 


execution. Large basic blocks in a program 
indicate relatively little control dependence 
(since all operations in a basic block are 


dependent on the same control signal); small basic 
blocks indicate much control dependence. 

A sample of 44 production FORTRAN programs 
was analyzed, ranging in size from 2500 to 125,000 
lines of code, all written for applications in 
nuclear and structural engineering. About 920,000 
lines of code were analyzed; after deletion of 
comments, declarations, etce, about 390,000 
FORTRAN statements remained to form the basis for 
the study. The size of the basic blocks was used 
aS an appropriate indicator of the amount of 
control dependence. Each program was scanned and 
basic block sizes were tabulated (among a wealth 
of other statistics). The average basic block 


size was found to be 3.5 statements or 23 
operations (using some rough approximations for 
operation counts). These figures substantiate 
that control dependence is indeed a_ serious 


problem, since not much useful parallelism can be 
expected within a block of only 23 operations. In 
essence, these results support what is clear from 
a visual analysis of typical sequential programs 
in almost any language: control structure, that is 
branching, looping, and subprogram calls, almost 
always breaks up a program into small basic 
blocks. This fragmentation makes most operations 
dependent on a control signal computed only 
slightly before the operation itself is to fire, 
regardless of the fact that its data operands may 
have been computed much earlier. 


Various solutions have been proposed in the 


literature to remove control dependence in 
programs: prefetching and executing both arms of 
conditionals (see Riseman and Foster [1] and 
Magid, Tjaden, and Messinger [4]), the use of 


boolean variables to turn control dependence into 


data dependence (see Padua, et al, 
various loop transformations (Padua [3]). 


[3]), and 


Kernel-control decomposition 


An alternative approach to removing control 
dependence is suggested by the theory of kernel- 
control decomposition (Pratt [2]). We first 
sketch the main theoretical results, then indicate 
why the theoretical potential is unrealizable in 
practice, and show an alternative approach that 
appears to bypass some of the _ practical 
difficulties. 


Briefly summarized, the theory of kernel- 
control decomposition shows that any program may 
be decomposed into a control _ part, which is 
concerned only with determining the control path 
to be taken by the program, and a kernel part, 
which is concerned only with computing the output 
results of the program. The surprising result is 
that, in principle, control dependence can be 
completely eliminated from the actual computation 
of the program (the kernel) by performing this 
decomposition and then executing the control part 
separately first to determine the control path. 


Practical Constraints on K-C Decomposition 


The theoretical result of complete removal of 
control dependence is tempered by severe practical 
difficulties. The first problem is that in real 
programs, very few variables and statements are 
pure kernel or pure control; most participate in 
both kernel and control computations. In the 
decomposition, these variables and statements must 
be copied into both kernel and control parts. The 
second major difficulty lies in the identification 
of the kernel and control components of a 
progran. To identify each component involves a 
process of backtracing control paths from output 
Statements (to identify the kernel part) and from 
conditional branch points (to identify the control 
part). This is a nontrivial task in large 
practical programs. The result of studying the 
decomposition potential of several medium sized 
FORTRAN programs is that complete decomposition is 
impractical, primarily for these two reasons. 


A Practical Approach 


Analysis of the sample of FORTRAN programs 
suggests that an alternative to complete kernel- 
control decomposition might be practical. We 
observe that in these practical programs, there 
are in fact a subset of the variables that are 
global control variables whose values are _ set 
directly from input data, and that thereafter 
remain constant during execution. These variables 
represent the problem size parameters, output 
option choices, and various other important 
parameters during a run of the programe Actually 
these are not pure control variables in the 
theoretical sense, because inevitably their values 
are printed out during the course of the run, but 
ignoring this output (which serves only for 
documentation of the run) they serve only control 
purposes within the progran. 


In a typical FORTRAN program, these global 


control variables are part of a COMMON block. At 
the start of the program, their values are 
initialized from data provided by the user in 
setting the problem size parameters and options to 
be used. Subsequently these control values are 
tested repeatedly during program execution to 
control branching and also to control the number 
of iterations of loops, but their values do not 
enter directly into the computation of output 
results except in minor wayse Another important 
practical observation is that for many large 
production programs, such as_ nuclear’ reactor 
simulations, these global control variables are 
used to set problem size parameters and options 
that remain constant over many runs of the same 
program, eeg-, for many runs of a _ large 
simulation. 


We use the concepts from  kernel-control | 
decomposition to decompose the program, but only 
for a partial decomposition based solely on these 
global control variables. The method is termed 


kernel-control tailoring (or K-C_ tailoring) and 
may be outlined as follows: 


1. Identify the global control variables of 
interest. This can be done automatically by 
scanning the source program, identifying 
"candidate" variables in COMMON blocks, and then 
deleting those variables that are assigned 
modified values in any subroutine, to get the 
final list of variables (making due allowance for 
possible coding tricks in FORTRAN that mask such 
assignments). : 


2. Identify the 
control variables 
data). 


input values for these 
(extract them from the input 


3. Preprocess each routine in the program to 
find the conditional branching and looping that is 
controlled by these control variables only. 


4. For each conditional branch found, 
determine the direction of the branch, using the 
known values for the control variables. Since the 
control values are invariant during execution, the 
direction of such a branch is always the same 
during execution.e Thus the code down the branch 
not taken is dead code and can be deleted. Also 
delete the conditional branch statement itself 
(and replace it by a GOTO to the proper branch if 
necessary) - 


5. For each loop found, determine the number 
of iterations of the loop, using the known values 
for the control variables. The loop may now be 
unrolled completely to become straight line code 
with no testing for termination, or it may be 
partially unrolled in various ways. Nested loops 
can be transformed to unroll the loop with the 
smallest iteration count, etc. 


6. Take the preprocessed program, which now 
has a substantial part of the control sequencing 
removed (ideally), and which also has a 
substantial amount of dead code eliminated 
(ideally), and process it for parallel processing 
in the usual way. 


7. The preprocessed program may now be 
executed repeatedly for different data. sets, 
provided the control variable input values remain 
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unchanged. Thus for a large production progran, 
the cost of the preprocessing is potentially 
spread over more than a single execution of the 
program. 


K-C tailoring is applicable to any ordinary 
sequential program (in FORTRAN or any _ other 
language) and produces a simplified program in the 
same language. The simplified program may then be 
optimized for parallel (or sequential) execution 
by the standard language processors, vectorizers, 
etce 


Effectiveness of the Approach 


To determine the potential effectiveness of 
this technique, the data base of FORTRAN programs 
was used again. A typical program was taken from 
the sample, and the number of control variables 


determined. The amount of control sequencing 
determined by these control variables was 
analyzed. The program consisted of 51 subprograms 


that contained 1736 variables that were used for 
control purposes. Of these 133 were identified as 
the global control variables of interest. The 
program was instrumented to determine the use of 
these global control variables. About 20% of the 
conditional branching and 40% of the looping was 
found to be determined only by the values of these 
control variables. A subsequent analysis of a set 
of benchmark FORTRAN programs showed a large 
variation among programs both in the number of 
global control variables and in the amount of 
control sequencing affected by these variables. 
We have observed counts of 30% of the conditional 
branching and 50% of the looping controlled by 
global control variables in some programs, but 
ranging down to some programs with very few global 
control variables controlling less than 1% of the 
total number of run-time control decisions. A 
more refined set of measurements are currently 
being implemented. These preliminary results 
indicate that there do exist a substantial class 
of production programs for which K-C tailoring 
affects a nontrivial portion of the run-time 
control decisions. 


Inner loops of critical routines are known to 
consume a large proportion of the computing time 
used by many large production programs. We are 
particularly interested, therefore, in the effect 
of K-C tafloring on the performance of inner 
loops. In the same reactor simulation program, a 
Single routine was measured to use about 40% of 
the computing time in a typical run. The critical 
part of this routine is a triply nested DO loop. 
The loop parameters for each loop are global 
control variables (representing the size of the 
reactor core in three dimensions). Thus these 
critical loops are directly affected by K-C 
tailoring. To observe the possible effects from 
various forms of K-C tailoring that involve 
substitution of known control variable values 
followed by loop unrolling, we manually performed 
these transformations on this routine and measured 
the speedup. When the inner loop alone was 
unrolled, a speedup of 1.3 to 1 was measured. 
When the two outer loops were unrolled (leaving 
multiple copies of the small inner loop), a 
speedup of 2.1 to 1 was observed. Both these 
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transformations may easily be made automatically 
during K-C tailoring. Finally a more complex 
transformation was performed that changed the 
nested I-J-K loops into a K-I-J nest, followed by 
unrolling of the two inner loops (leaving one 
large outer loop). This transformation led to a 
measured speedup of the loop alone of 3.2 to l, 
but since some array transposition was required, 
the overall speedup was 2.4 to Il. Complete 
unrolling of the entire loop nest was considered 
impractical due to the code expansion involved. 
Dongarra and Hinds [5] show similar results from 
studies of the performance improvements due _ to 
loop unrolling in FORTRAN programs. 


Conclusion 


Kernel-control tailoring of a sequential 
program can remove a substantial amount of control 
dependence by using a simple preprocessing of 
static code. After the K-C tailoring is complete, 
the program may be more effectively transformed 
and optimized by existing methods, such as those 
of Kuck, et al. [3] for parallel and _ vector 
machines and those of ordinary global optimizers 
for sequential computers. Thus K-C tatloring 
appears promising as an adjunct to existing 
methods of optimization. 


The theory of K-C decomposition suggests that 
there is a range of options available in removal 
of control dependence through these techniques, 


ranging from the straightforward static 
preprocessing proposed here, through levels of 
increasingly sophisticated (and costly) 


separations of control and kernel parts based on 
dynamic execution of a partial control part, to 
complete decomposition. A deeper understanding of 
this range of options may lead to additional 
practical methods for K-C tailoring. 
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Abstract 

This paper presents two models of an 
instruction execution pipeline incorporating an 
instruction prefetching strategy. One model 
ignores operand accessing and thus models a 
computer system with separate program and data 
memories. The second model accounts for operand 
accessing. The throughput and memory traffic of 


the prefetch strategy are analyzed based upon run 
statistics derived from trace tapes of programs 
compiled for the IBM 360/370 architecture. 


1. Introduction 


Pipelined processors frequently 
instruction prefetching to supply instructions to 
the pipeline without delay. Little work has been 
published which analyzes instruction prefetching 
at the instruction word level in pipelined 
computers. Yet an understanding of instruction 
prefetching is crucial to the design of high 
performance computers when it is not economical to 
use a large, fast memory, or when available memory 
technology is not fast enough to support the rate 
of instruction execution desired. ) 

A prefetching strategy can be stated as 
follows. Instruction words ahead of the one 
currently being decoded are fetched from the 
memory before the instruction decoding unit 
requests them. Thus, the memory access time of an 
instruction word is masked by the execution of 
previously fetched instructions. If the 
instruction stream being executed was purely 
sequential, by fetching instructions early enough, 
no instructions past the first one executed would 
see any delay, and the memory would experience no 
more requests for instruction words than if no 
prefetching was being performed. 

Successful branches disturb the sequentiality 
of the instruction stream. A program consists of 
unconditional branch instructions, conditional 
branch instructions, and non-branch instructions. 


Conditional branch instructions result in a 
transfer of control only if the condition they are 
testing is true, SO conditional branch 


instructions may be successful or unsuccessful. 
Unconditional branch, or jump, instructions always 
result in a transfer of control. Therefore, they 
are always successful. 

One of the first analyses of 
prefetching was done by Rau[1,2]. His analysis 
does not account for the effect of operand 
accessing upon instruction prefetching. 


instruction 
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2. Analytical Model 


The prefetch strategy of the model requires the 
hardware diagrammed in Figure 1. It is assumed 
that the memory is interleaved and can accept 
requests at a maximum rate of one request per 
cycle. A cycle is simply the smallest unit of time 
which the model is aware of and corresponds to a 
hardware clock cycle. All requests require T 


cycles to return from memory. There are two 
prefetch buffers, of sizes s and t instruction 
words. The s-size buffer holds instructions 


fetched during the sequential part of a run. When 
a branch is successful, the entire buffer is 
invalidated. The other buffer holds instructions 
fetched from the target of a conditional branch. 
Similarly, when a conditional branch is resolved 
and determined to be unsuccessful, the contents of 
this buffer are invalidated. The non-pipelined 
instruction decoding unit requests instruction 
words at a maximum rate of one word per r cycles. 
If the instruction requested by the decoder is 
available in the sequential buffer for sequential 
instructions, or is in the target buffer if a 
conditional branch has just been resolved, and is 
successful, it enters the decoder with zero delay. 
Otherwise, the decoder is idle until the 
instruction returns from memory. After r cycles, 


the instruction type is known. 
The program is assumed to begin with a jump 


instruction. For the first 1l+s cycles, 1 memory 
request for an instruction is issued each cycle, 
to fill up the sequential prefetch buffer. 
Thereafter, one request is issued every r cycles 
until a conditional branch or jump instruction is 
decoded. 

Except for jump instructions, all decoded 
instructions enter the execution pipeline, where E 
units are required to complete execution. If the 
decoded instruction is an unconditional branch the 
instruction word at the target of the jump is 
requested immediately by the decoder, and decoding 
ceases until the target instruction returns from 
the memory. The pipeline will see the full memory 
latency time, T, since there was no opportunity 
for target prefetching. 

If the decoded instruction is a conditional 
branch, sequential prefetching is suspended during 
the E cycles it is being executed. The 
instruction simultaneously enters the execution 
pipeline, but no more instructions are decoded 
until the branch is resolved at the end of E units. 
Instructions are prefetched from the target memory 
address of the conditional branch instruction. 
Requests for t target instructions are issued at 
the rate of one per cycle. Once the branch is 
resolved, target prefetching becomes unnecessary. 

If the branch is successful, the target 
instruction stream becomes the sequential stream, 
and instructions are requested every r units from 


this sequential stream. 
stream begins when the target of the branch 
returns from memory, or whenever E units have 
elapsed, whichever is later. 

If the branch is unsuccessful, instruction 
requests are initiated every r units of time 
following the branch resolution and continue until 
the next branch or jump is decoded. 

A list of the parameters necessary to describe 
the model appears in Table 1. All intervals on the 
following time diagrams are assumed to be closed 
on the left and open on the right. 

A run consists of all instructions executed 
following the decoding of a successful branch 
instruction, and terminating with the decoding of 
another successful branch instruction. There are 
only two run types: those that begin with an 
unconditional branch, and those that begin with a 
successful conditional branch. Figure 2 
illustrates the two run types and details the 
instructions and the execution times that are 
counted as part of the runs. A run whose first 
instruction is the target instruction of an 
unconditional branch will be referred to as a 
u-run, while a run whose first instruction is the 
target of a successful conditional branch will be 
referred to as a cs-run. 

Figure 3a is a time diagram of the execution of 
a u-run. An unsuccessful conditional branch 
occurs during the run. Instants when memory 
requests are submitted for sequential instructions 
are marked with an asterisk, while instants when 
target requests are submitted to the memory are 
marked with a '‘'+'. Figure 3b diagrams the 
execution of a cs-run. 

The performance measures used to evaluate 
prefetch strategy are the throughput of the 
execution pipeline and the memory traffic the 
strategy generates. The throughput is defined to 
be the number of instructions executed divided by 
the execution time of the program, and the memory 
traffic is defined to be the number of memory 
requests generated divided by the execution time 
of the program. The throughput is bounded by 1/r; 
memory traffic is bounded by one. 


Execution of this new 


the 


2.1 Analysis of the non-operand fetching case 


At the beginning of every run the condition of 
the prefetch buffers is known. Therefore, the 
prefetching and execution behavior of a program 
can be reconstructed by breaking the instruction 
stream up into runs, and analyzing the performance 
of the prefetch scheme for each run case. For 
arbitrary values of the parameters the analysis 
becomes complex and amounts to simulation of the 
run. By restricting the parameter’ space, 
analytical results are obtained. We do not 
present the lengthy derivations of the results; 
for detailed derivations the interested reader is 
referred to [3]. 

Let the interbranch distance be defined as the 
number of instructions from one branch instruction 
to the next, excluding the first branch, but 
including the second. Let the ith such interval 
in a run be denoted by 1, . Thus, referring to 
Figure 4, ], = 2 is the number of instructions from 
the beginning of the run to the first branch, 
which is an unsuccessful conditional branch. 
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An unsuccessful conditional branch removes 
memory cycles from sequential prefetching. 
Therefore, after the branch is resolved, the 
number of instructions in the sequential buffer 
may not be sufficient to support the maximum rate 
of execution, and the decoder may have to wait for 
instructions to return from the memory. Thus an 
unsuccessful conditional branch, after its 
resolution, adds a delay to the execution time of 
a run. 

Let 2; be the number of instructions in the 
sequential buffer after the resolution of the ith 
unsuccessful conditional branch in a run, and let 
d; be the delay added to the execution time of a 
run due to the ith unsuccessful conditional 
branch, that is, due to an insufficient number of 
instructions in the sequential prefetch buffer. 
For convenience we define 2) to be the number of 
instructions in the sequential prefetch buffer at 
the start of the run and d, to be the delay 
incurred at the start of the run. Then, for a run 
with k intervening unsuccessful conditional 
branches, we have the following: 

k 


Number of instructions executed in any run = 24; 
i= 


Non-operand fetching results; 0<s <T< E+1 


Unconditional Run: 


0 l..,<z; 
d.-= : l= 1 
i max (0,T—rz,), I >z; Isisk (2) 
k+1 k 
execution time of run = KE + Dri, + = d; (3) 
i=] iz] 
number of memory requests = 
k+1 k d. T 
ne L=S5 
kt + 21; += - ] +s+ [ - | (4) 
Conditional successful run: 
Zo = f; (5) 
dy = max (0,T — E) + max (0,E + T — tr — max(T,E)) (6) 
d. 
z;= min (s,z,, + [4] (7) 
d; = same as eq. (2) 
execution time of run = E + eq. (3) (8) 


k+1 kK rd. 
number of memory requests = (kK + 1)¢+ 21,42 | | (9) 
i=1 i=0 


2.2 Effect of operand accessing 


Operand accesses interfere with instruction 
prefetching when program and data memory are not 
Separate. To prevent execution pipeline delays, 
operand fetches must preempt any instruction 
fetches which would otherwise occur. Since memory 
cycles are stolen from the prefetch mechanism, 
instructions might not be prefetched in time to 
avoid decoding delays. The terms requiring an 
operand and operand accessing refer to both 


operand fetching and storing. 
We assume that an instruction which requires an 
operand may require only one operand and may not 


be a branch instruction. After r units the 
decoder will realize that the instruction requires 
an operand. At this point, any instruction 
prefetching which would have occurred is preempted 
and a memory request is issued for the operand. 
Simultaneously, the instruction enters’ the 
execution pipeline. For the instruction to make 
use of the operand while it is in the pipeline, the 
operand must return prior to E units after it 
entered the pipeline. We therefore assume 
Oss<TSE-1 
for the remainder of Section 2.2. 
We can model operand accessing by introducing 
p, the probability that an instruction requires an 


operand, given that it is not a branch or jump 
instruction. This probability can be estimated 
from program trace tapes. Furthermore, we will 
set r=l1, since this presents the most reasonable 
case. Equations for u-runs and cs-runs are 
presented. 

Unconditional Run: 

Zo = 0; d; = same as equs. (1) and (2) (10) 
z,= min (s,z;_, + 4,_, + (1 —- p)(-1) + 1—I, (11) 
+ p*max(0,l,—-1 — T)); 1<isk 
execution time of run = 
k+1 
kE se + d;_, + p*max(0,l,-1-—T)] (12) 
ix 
number of instruction requests = 
k+1 
kt + 2 Idi + 1+ (1 — p)(j-1) + p*max(0,/;,-1-—T)] (13) 
= k+1 
number of operand requests = p2 (/;—1) (14) 
i=1 
Conditional successful run: 

29 = 1; Zz; = same as eq. (11) (15) 
d,; = same as eqs. (1) and (2) (16) 
execution time of run = E + eq.(12) (17) 
number of instruction requests = ¢ + eq.(13) (18) 
number of operand requests = ¢ + eq.(14) (19) 


3. Results 


Since our analysis is not exact and also since 
it is not valid for certain ranges of parameters, 
a trace-driven simulator was used to verify the 
analytical results. Program traces were broken up 
into the two run types and runs were further 
classified by the number and _ spacing of 
intervening unsuccessful conditional branches. 
The frequency of occurrence of each run was used 
to weigh the execution time and memory request 
equations presented in Sections 2.1 and 2.2. 


Two program traces were analyzed. These 
programs were compiled for the IBM 360/370 
architecture. One program, GAUSS, is a FORTRAN 


execution of a Gaussian elimination program. The 
other program, SLIST, is a trace of a PL/I list 
processing program. Table 2 lists some of the 
important statistics of each program. 

For the non-operand fetching case for T <= E + 
1, simulation results differed from analytical 
results by at most 2% for GAUSS and at most 4% for 
SLIST. The analytical model is only approximately 
valid for T > E + 1. For this T for GAUSS, the 
simulation results were still within 2% of the 
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analytical model; however, for SLIST, the error 
was less than 9%. The error results from the 
treatment of unsuccessful conditional branches in 
our derivation. Note in Table 2 that the number of 
of unsuccessful conditional branches in SLIST is 
about 4 times that of GAUSS. We have included the 
performance of the non-prefetching case. For this 
case we assume that an instruction fetch request 
is issued to memory after r units for non-branch 
and jump instructions and after r+E units for 
conditional branches. Figure 5a shows’ the 
throughput as a function of target prefetch buffer 
size. Throughput saturates when t=3. For the 
parameters shown, this is to be expected. A 
maximum delay of T cycles may occur followirig the 
resolution of a successful conditional branch. To 
overcome this delay, enough instructions must be 
in the buffer or must have been requested such 
that the decoder can run without delay for T 
cycles.. Loosely, then, t. should be chosen such. 
that rt=T to eliminate the delay cycles. This 
means that choosing t=3 will give the maximum 
throughput for a given s. Increasing the 
sequential buffer size effectively scales the 
throughput, and, as for t, s>3 does not enhance 
the performance. This is similar to target 
prefetch buffer size saturation. 

Figure 5b plots the memory traffic versus t for 
the same trace, SLIST, of Figure 5a. The traffic 
is not nearly as sensitive to s as the throughput 
is. This occurs because increasing s with t fixed 
has two compensatory effects. On the one hand, 
increasing s increases the number of sequential 
requests generated immediately following the 
decoding of an unconditional branch. On the other 
hand, increasing s reduces the  d-delays 
immediately following the resolution of 
unsuccessful conditional branch instructions. 
Since sequential requests are generated during 
these delay cycles, increasing s decreases the 
number of these requests which are generated. 

Increasing t with s fixed also has_ two 
compensatory effects. As t increases, the number 
of requests for instructions at the target address 
of all conditional branch instructions inctfeases. 
As t increases, the execution time of cs-runs 
decrease, because increasing t reduces the delay 
which follows the resolution of a _ successful 
conditional branch instruction. For both SLIST 
and GAUSS, the number of conditional branches is 
sufficiently greater than the number of successful 
conditional branches to cause the memory traffic 
to increase as t increases with s fixed. 

Results are not presented for GAUSS, since the 
effect of s>l on throughput and traffic is 
negligible. This is due to the predominance of 
successful conditional branches (11.9%) combined 
with the relative lack of unsuccessful conditional 
branches (2.1%). Since the frequency of 
successful conditional branches in SLIST and GAUSS 
is about the same (10.2% and 11.9%, respectively), 
the effect of t is similar in both cases. 

Finally, we present the effect of operand 
accessing on performance. Figure 6a plots 
throughput versus the degree of target prefetch 
for s=1 and s=3, for both the model and simulation 
for SLIST. The p used for the model was 0.159, 
which is derived from the trace of SLIST. Thus, a 
meaningful comparison can be made between the 


simulation and the model. The operand accessing 
model is slightly optimistic. Figure 6b plots the 
traffic for the simulator and the model. For this 


combination of parameters, the operand fetching 


model yields results within 8% of the simulation. 


4. Concluding Remarks 


Unsuccessful conditional branches cause delays 
resulting from an insufficient sequential buffer 
size to appear. They also enable operand accesses 
to cause delays which can not be overcome by 
increasing the size of the sequential prefetch 


buffer. Programs which have few unsuccessful 
conditional branches will not suffer from these 
problems. 


While prefetching from the target of 
conditional branch instructions reduces delays for 
successful conditional branches, it substantially 
increases the memory traffic. Increasing the size 
of the sequential buffer may increase or decrease 
the traffic, depending upon the parameters and 
program under consideration. 

Since conditional branches are much more 
frequent than unconditional branches, significant 
improvements in performance require a predictive 
prefetch strategy. In particular, based upon the 
detailed run analyses performed to evaluate the 
equations of the model, some of Smith's[4] 
strategies may work well. 
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Table 1. Model Parameters 


Parameter Description Range 
r Instruction decode time >=1 cycles 
Ss Size of sequential >=0 words 


prefetch buffer 


t Size of target >=0 words 
prefetch buffer 


E Execution pipeline length >=1 segments 
T Memory access time >=1 cycles 
p Probability an instruction 


requires an operand 


° 
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69935 2007 2.9 7127 10.2 5773 
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Table 2. Program Statistics. 
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Abstract 


LUCAS (Lund University Content Addressable System) 
is an associative array computer in the SIMD 
(Single Instruction - Multiple Data) category. 
This paper describes the programming. of the machi- 
ne on different levels and introduces the soft- 
ware tools that are used. A high level language 
that takes full advantage of the architecture and 
yet allows powerful manipulation of data on an 
algorithmic level, is presented. Programming exanm- 
ples show the use of the language in signal pro- 
cessing applications and data base management. 


1. Introduction 


An associative array processor is an associative 
memory where each memory word contains its own 
processing element and where a communication net- 
work is defined between the words. The organiza- 
tion is of type SIMD (Single Instruction - Multip- 
le Data), meaning that the same instruction is 
executed in parallel on different data. An impor- 
tant feature with associative array processors is 
the possibility to search the memory contents in 
parallel. The result of the search (contents in 
words satisfies/does not satisfy the search cri- 
terion) is marked in an activation register, cal- 
led the tag register, in each word. Subsequent 
parallel operations may be limited to the words 
where the tag register contains the value 'true'. 


The interest in associative array computers has 
been steadily increasing during the last two de- 
cades. It has been possible to add the complexity 
needed without paying the penalty of much higher 
costs. The regular structure in this kind of 
machines 1s very well suited for implementation 
in VLSI, and as VLSI design techniques become more 
widespread, the interest will continue to grow. 


Since the introduction of the first commercially 
avallable associative computer, STARAN in 1973, 
several general purpose high level languages for 
programming parallel array computers have been 
proposed. Some of these can be said to be true 
general purpose languages in that no special app- 
lication and no special machine has guided the 
design [1-4], while others are oriented towards 
an existing computer such as STARAN [5], DAP [6], 


*This research was in part supported by the 
National Swedish Board for Technical Development 
under grants 79-3770 and 81-606 at the Universi- 
ty of Lund. 


0190-3918 /82/0000/0253$00.75 © 1982 IEEE 


and ILLIAC IV [7-9]. Important application areas 
for associative array computers are image proces-— 
sing and data base management. Special purpose 
languages for these applications have also been 
designed [10,11]. 


LUCAS (Lund University Content Addressable System) 
is an associative array processor currently under 
development at the University of Lund, Sweden. 

The working mode is bit serial and the machine is 
constructed from off-the-shelf components. This 
paper describes different programming levels on 
the machine and includes a description of a high 
level language defined as an extension to Pascal. 


2. The Architecture of LUCAS 


LUCAS consists of two parts: a Control Unit and 
an Associative Array. By a simple interface it is 
attached to a host computer, at present a PROLOG 
Z80 microcomputer system with CP/M operating sys- 
tem. The host computer sends instructions to the 
Control Unit and handles input and output of data 
(fig 1). Alternatively, input and output can be 
handled by a dedicated I/O processor that direct- 
ly communicates with a fast secondary memory. A 
detailed description of the design is given in 


Li2). 


2.1 The Associative Array 


The Associative Array (fig 2) consists of 256 
words that are interconnected by a powerful inter- 
connection network. The design of one word is de- 
picted in fig 3. 


The word length is 4096 bits. Data is accessed one 
bit at a time, pointed to by a 12-bit address from 
the Control Unit. A processing element for bit- 

serial computations is also included in each word. 


The processing element comprises an ALU and four 
one-bit result registers: T (tag), R (result), C 
(carry) and X (auxiliary). The ALU performs opera- 
tions on the five one-bit arguments M, T, R, C 

and X. The M input can recieve data from either 
the corresponding memory word or from another word 
via the interconnection network. In this way data 
can be moved up and down the array or interchanged 
according to different communication patterns, 
e.g. the perfect shuffle. ; 


The tag register is used as an activation control 
for the word since it enables the write signal to 
the memory. This mechanism is of the highest im- 


portance and is in fact the key to associative 
array computing. A parallel search selects cer- 
tain words by setting their tag registers accor- 
ding to the results of the search. Then follows 
computation in parallel on selected words. 


Input and output of data is done via a shift re- 
gister which is directly accessed from both LUCAS 
and the host (or the I/O processor). 


2.2 The Control Unit 


The Control Unit (fig 4) accepts instructions from 
the host computer and executes these in the form 
of microprograms. An instruction to LUCAS can in- 
clude up to four parameters - ususally describing 
the locations and the lengths of the operands. 


An important part of the Control Unit is the Add- 
ress Processor. It performs fast computations of 
addresses to the Associative Array (increment, 
decrement, add constant etc). The Address Proces- 
sor contains several index registers and a data 
stack. Furthermore the Control Unit contains a 
Common Register which is used for parallel search- 
ing in the Associative Array and for operations 
between one dimensional and parallel data (e.g. 
add the same value to data in every word of the 
array). 


An example shows the interaction between LUCAS 
and the host: 


Given an n x m dimensional matrix M and an m di- 
mensional vector V, calculate the n dimensional 
vector X=MxV. 

The elements of V are stored in the Common Regis- 
ter and the elements of M in the Associative 
Array. Space is reserved for X in the Associative 
Array. 


Common V[1] sae Vim] 


Word 1 M[1,1]  ... M{1,m] x[1]  s[1] 
Word 2 M[2,1] ->» ML2,m] x2] sf[2] 


Word n M[n,1] pits M[n,m] X[n] s[n] 


SSS = > <--> <--> 
area in the Associa- area scratch 
tive Array that is reser- pad area 
reserved for the ved 

matrix M for X 


Actions performed on Actions performed on 


the host. LUCAS. 
1 Specify operation CLEAR 
FIELD to LUCAS with pa- 
rameter indicating the 
X vector 
2 Execute CLEAR FIELD 
operation 
3 Specify operation to se- 
lect the n first words 
4 Select the n first 
words by putting a one 
in the tags 


5 set j=1 


6 calculate the address to 
the j:th elements in V 
and M 


7 Specify the operation 
MULTIPLY COMMON with 
parameters indicating 
the j:th elements in V 
and M and the scratch 
pad area 


8 Execute the MULTIPLY 
COMMON operation, lea- 
ving the result in the 
scratch pad area 


9 Specify operation ADD 
FIELDS with parameters 
indicating the X vector 
and the scratch pad area 


10 Execute ADD FIELDS 
operation 

11 j:=j+1 

12 if j <=m then go to 6 

13 exit 


As can be seen in the example, the host keeps 
track of the variables in the Associative Array, 
while all the actual calculations are done in pa- 
rallel in LUCAS. This is the normal interaction 
between LUCAS and the host. 


The interface is designed so as to let the two 
actions overlap. 


3. The software structure 


Great effort has been put into making LUCAS flexi- 
ble. Mainly intended as a research machine, dif- 
ferent aspects of array computing have been and 
Will be exploited. This has led to a design which 
1s programmable on several levels. 


3.1 Micro Programming 


The LUCAS machine instruction set, which constitir 
tes the software interface to the host computer, 
is defined as a set of microprograms, executed by 
the control unit. Prior to the execution of an 
application program, the microprogram memory can 
be loaded from the host. Machine instructions may 
range from single search operations up to compound 
operations such as matrix multiplications or ope- 
rations on data base relations. 


In each clock cycle a total of 80 control signals 
are sent from the Control Unit to various parts of 
LUCAS. The control signals can be divided into 
five groups: host communication (6), microprogram 
flow control (32), Associative Array control (16), 
Address Processor control (26) and one group for 
user defined auxiliary functions (2). 


Fig 3 shows a processing element in the Associa- 
tive Array. As can be seen in the picture, the 
ALU has six inputs and four outputs. The function 
performed in the ALU is controlled by a as bit 
code. This means that only 32 of the #06 possi- 


ble functions in the ALU can be performed. Since 
all computations are done bit-serially, it is 
extremely important that the ALU instruction set 
is well suited for the kind of calculations done, 
to avoid unnecessary overhead, 


The implementation of the ALU, in the form of a 
user programmable read only memory, allows opti- 
mization of its instruction set for different 
applications. In signal processing applications, 
the instruction set would be strongly oriented 
towards arithmetic calculations, while in data 
base management applications the transfer of bits 
between the four registers and the memory, as well 
as boolean functions such as AND, OR ete would be 
chosen. 


3.2 Interaction with sequential programs on the 


host 


To avoid dependency upon the choice of the host 
computer, nearly all the system software develo- 
ped has been written in Pascal. For the same rea- 
son it was decided that application programs also 
should be written in high level language. A lib- 
rary of Pascal procedures that interacts with 
LUCAS are incorporated in the system library. 


It is interesting to note that the penalty for 
using a high level language, as compared to assenm- 
bly language, in terms of speed, is remarkably 
small. Especially in applications with heavy com- 
putations. This is due to the fact that most of 
the sequential operations performed are housekee- 
ping operations and can overlap in time with com- 
putations in LUCAS. 


Assembly programming is kept to a minimum, and in 
fact less than 50 bytes of machine code is needed 
to set up the software interface with LUCAS. This 
means that all the software developed is complete- 
ly transportable between different hosts. 


4. A high level programming language 


The rest of this paper is a description of a high 
level language that is currently being implemented 
on LUCAS. We start the presentation with a brief 
discussion of what we want to obtain followed by 
an informal description of the language elements. 
In the subsequent sections each element is descri- 
bed in detail and the use of the language is pre- 
sented in the form of programming examples. Final- 
ly a complete description of the syntax is given. 


4.1 Primary Considerations 


There are two different approaches to the design 
of high level languages for parallel computers: 


the parallelism of the computer has a correspon- 
dence in the syntax of the language. 


the syntax of the language does not contain any 
primitives for parallel computations, but the 
compiler tries to detect inherent parallelism 
in the sequential program and to generate code 
for the parallel computer. 


Our view is that the parallelism should be appa- 
rent in the language. If the parallel structure 


255 


of the computer has no correspondence in the lan- 
guage, the user will be less motivated to design 
his algorithms in a way which involves parallel 
computation. It is unlikely that an algorithm for 
a sequential computer could easily be transformed 
to fit into a parallel machine. This not only 
calls for unnecessary complications in the compi- 
ler, but also leads to less efficient use of the 
parallel computer. 


4.2 Guidelines for the design 


The following requirements for the language are 
stated: 


The 
can 


The 
the 
can 


The language should be suitable for and empha- 
Size the use of structured programming. 


language should only include constructs that 
be efficiently implemented on LUCAS. 


language should be functionally complete in 
sense that all possible algorithms for LUCAS 
be expressed in it. 


The language should be kept small and be fairly 
easy to implement. 


The use of an existing sequential programming lan- 
guage as a base for the new language would have 
many advantages: 


- The sequential language has a well defined syn- 
tax. 


- Implementation is simpler since existing compi- 
lers can be modified to accept the new language. 


- The user needs to learn relatively few new con- 
cepts. 


4.3 Description of the language 


We decided that the language should be an exten- 
sion to Pascal and it is referred to as Pascal/L. 
The choice of Pascal meets the two last require- 
ments stated above, provided that the extensions 
are chosen carefully. 


Pascal is a well structured language with strong 
typing of variables. Its structure allows a large 
amount of error detection both at compile time and 
at run time. Compilers for Pascal are relatively 
Simple to implement. The syntax has been chosen 
so that only one symbol lookahead is needed, ena- 
bling the use of a simple parsing technique. To 
facilitate the code generation and to make the 
compilers more portable, an implementation scheme 
with code generation for a stack oriented virtual 
machine is used. The existence of portable compi- 
lers, often written in Pascal, simplifies its im- 
plementation on different machines. 


The following extensions to Pascal are defined: 


- Declaration of variables that will be allocated 
in the Associative Array. In the following these 
will be referred to as parallel variables. 


An indexing scheme to access parts of parallel 
variables. 


Expressions involving parallel variables. 


An extended control structure, allowing the use 


of parallel variables as control variables. 


- Standard functions for alignment of parallel 
variables. 


- Input and output of parallel variables. 


In the text, the words 'scalar' or ‘sequential 
variable' stand for variables that are allocated 
in the host computer memory, while a ‘parallel 
variable' is allocated in the Associative Array. 


Appendix A contains a syntax summary of the lan- 
guage. 


4.3.1 Data Declaration 


Parallel variables are characterized by their di- 
mension and their range. The dimension of a paral- 
lel variable refers to the number of subscripts 

in the declaration. The range, which can be seen 
as a measure of the parallelism, is given by the 
size of the first dimension. 


The linear organization of the LUCAS Associative 
Array makes it especially suited for operations 
on one- and two-dimensional arrays. In principle, 
arrays of any dimension can be represented in 
LUCAS. However the natural storing scheme for one- 
and two-dimensional arrays, where adjoining array 
elements also are physical neighbours, will be 
lost. In this description of Pascal/L we are the- 
refore only concerned with arrays of one and two 
dimensions, even though the final definition of 
the language probably will include arrays of high- 
er dimensions. 


There are two kinds of parallel variables: 
selector and parallel array. 


A selector is defined as a boolean bit vector in- 
tended to limit the parallelism of the operations 
in the Associative Array. At execution time this 
1s accomplished by setting the tag registers in 
those memory words where the corresponding selec- 
tor element has the value true. When a selector is 
declared, it can optionally be initialized to any 
value. 


<selector type> ::= selector [constant..constant]| 
selector [constant..constant] := <boolean 
aggregate> 


<boolean aggregate> ::= <choice> => <boolean value>| 
<choice> => <boolean value> , others => 
<boolean value> 


<choice> ::= constant } constant..constant | 
constant..constant step constant 


<boolean value> ::= true {| false 


Examples: 
var a: selector[0..255]; 
var a: selector [0...2559:=(0,1,5=>true, others= 
=>false>; 
var a : selector[0..199]:=(0..126 step 2=>true, 


others=>false>; 


In the declaration of a parallel array the first 
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dimension specifies the maximum range of paralle- 
lism for the variable. 


var para : parallel array[0..99,0..2] of integer ; 


declares the variable para to be defined in the 
words 0 to 99 in the Associative Array. Each word 
contains three elements of para. 


<parallel array type> :: 
parallel array [constant..constant ] 
of <parallel component 
~  type> | 
parallel array [constant..constant , constant 
. constant] 
of <parallel component 
type> 


<parallel component type> ::= <parallel type> 


<parallel type> ::= <parallel type identifier> | 
<parallel standard type> | record <parallel 
field list> end 


<parallel type identifier> ::= <identifier> 


integer } real } boo- 
lean { char 4 


string[constant] 


<parallel standard type> :: 


— 


<parallel field list> ::= <parallel record section> 
{ ;<parallel record 


section> } 


<parallel record section> :: 
<field identifier> { , <field identifier> }: 
<parallel standard 
type> 


Example: 


var parrec : parallel array[0..99] of record 
a,b: integer; 
Cc real 
cud; 


4.3.2 Indexing 


When operating on a parallel variable it is possi- 
ble to reference several elements along its paral- 
lel dimension at the same time. This set of ele- 
ments is referred to as the range of parallelism 
for the operation. If the index is omitted, the 
complete array is referenced. 


<parallel indexed variable> 

<parallel array variable>[<first index> ] | 
<parallel array variable>[<first index> , 
<expression> ] 


<parallel array variable> ::= <variable> 


<first index> :: 


<selector expression> | * } 
constant | constant..constant 


<selector expression> ::= <expression> 
Examples: 
para[*,0] Selects column 0 of para. Para is a 
two dimensional parallel variable. 
parala,O] Where a is a selector, selects a sub- 


set of column 0 in para. 


para[2..4,0] Selects elements 2,3 and 4 of co- 
lumn O in para. 


4.3.3 Expressions and Assignment 


It is possible to combine sequential and parallel 
variables in expressions as long as no type con- 
flict occurs. An expression that contains parallel 
variables always results in a parallel value (ex- 
cept for some standard functions that take a pa- 
rallel variable as an argument and yield a scalar 
value). The meaning of expressions such as: 


4 ® para or para > 4 


where para is an array, is that the scalar is com- 
bined with each element of the parallel variable. 


There are four kinds of assignment statements: 


1) Left side and right side are scalars. 
This is the normal Pascal assignment statement. 


2) Left side is parallel and right side is scalar. 
All elements within the range of the left side 
variable are assigned the value of the scalar 
expression. 


3) Left side is scalar and right side is parallel. 
The right hand side of the assignment should 
be a parallel variable indexed so that only one 
element is selected. 


4) Left side and right side are parallel. 
The elements within the range of the left hand 
Side variable are assigned the corresponding 
values of the right hand side expression. The 
range of the expression must be equal to, or 
overlap, the range of the left hand side varia- 
ble. 


In expressions, all parallel variables must have 
the same range, otherwise a run time error occurs. 


The following program gives an example of diffe- 
rent kinds of assignment. 


Program Assign; 
var odd : selector[0..255]:=(1..255 step 2=> 
true, others=>false); 
even, sel : selector[0..255]; 
pl,p2 : parallel array[0..255] of integer; 
i: integer; 
begin 
even:=not odd; (* Both sides parallel. The 
; same range *) 


pl[even]:=p2*2; (* Both sides parallel. The 
range of the right side 
expression overlaps the 
range of the left side 
variable. *) 


pt[odd]:=0; (* Left side parallel. Right 
Side scalar. *) 
i:=p2[5]; (* Left side scalar. Right 


side parallel. The range 
of parallelism includes 
one element. *) 

sel:=p1 > p2; (* Both sides parallel. The 
same range. *) 

i:=p2[sel]; (* Left side is scalar. Right 
is. parallel. sel must have 
one and only one true ele- 
ment. *) 

end. 


In statements where data is stored in different 

words in the Associative Array, the movement of 

data must be explicitly specified using standard 
functions for data alignment. 


var pl,p2 : parallel array[0..100] of real; 
begin 


(* p1[2]:=p2[3]; is not allowed. Should be 
written: *) 
pi[2]:=shift(p2,-1); 


(* pi{4..84]:=p2e[0..80]; is not allowed. 
Should be written: *) 
pi[4..84]:=shift(p2,4); 


4.3.4 The Control Structure 


To control the sequential program flow, Pascal 
contains five different structured constructs: 

if, case, while, repeat and for statements. The 
first two are used to select different paths in 
the program execution, while the remaining three 
control repetition of statements. Similar concepts 
are included in Pascal/L to allow parallel expres- 
sions to control selection and repetition. 


The construct: 


if boolean expression then true-statement 
else false-statement 


in Pascal selects one of two different paths in 
the program flow, depending on the value of the 
boolean expression. In the corresponding parallel 
statement, the boolean expression yields a selec- 
tor. Each element in the selector determines what 
statements will be executed on its corresponding 
data elements. 


In a global perspective this means that both the 
true statement and the false statement are execu- 
ted, but on different data. Rather than to extend 
the if-then-else construct in Pascal, a parallel 
selection takes the form: 


where parallel boolean expression do true- 
statement 
elsewhere false-statement 


where the elsewhere-part is optional. 


Analogous to the Pascal case statement, which is 
a generalization of the if-then-else construct, 
where the selection is based on the value, of the 
expression given at the head of the case state- 
ment, Pascal/L defines a parallel form of the case 
statement. As with the if-then-else construct, the 
parallel case does not choose one execution path 
but all, each working on different data. The form 
of the parallel case statement is: 


case where parallel expression of 


constant! : statement; 
constante statement 3 
constantn : statement; 
others : statement 


end 


where the others-part is optional. In the imple- 


mentation of the parallel case statement, care 
must be taken so that the code generated assures 
that only one choice is made for each word in the 
Associative Array. Since every choice in the list 
is taken, one after another, it is possible that 

a variable in the head expression is changed so 
that a second correspondance would occur, this ti- 
me with another constant. 


In a similar way an extension to the Pascal 


while boolean expression do repetition state- 
ment 


is defined to control repetition for parallel data 


while and where parallel boolean expression 
do repetition statement 


Here the repetition statement is repeated as long 
as the parallel boolean expression takes the va- 
jue true in any element. The repetition statement 
is only executed on data. where the corresponding 
element in the boolean expression has the value 
true. 


The following example illustrates the use of the 
while and where construct: 


vi,v2 : parallel array[0..2] of integer; 


begin 
vifoO):=2; vil1)]:=4; vi[2]:=3; v2[0]:=0; 
ve[1]:=0; ve[2]:=0; 
while and where v1 > 0 do 


begin 
v2: =2¥*vetvi ; 
vis=vi-1 
ends 


loop iteration 
1 v2[0]<-2*0+2=2 
v2[1]<-2*0+h=h 
v2[2]<-2*0+3=3 


2 v2[0]<-2*2+1=5 
v2[1]<-2*44+3=11 
v2[2]<-2%*34+2=8 


3 v2[0]<-5 unchanged since v1[0]=0 
ve[1]<-2%*114+2=2h 
v2[2]<-2*8+1=17 


mH v2[0]<-5 unchanged since v1[0]=0 
v2[1]<-2*2h+1=h9 
ve2[2]<-17 unchanged since v1[2]=0 


The loop ends after four iterations since all ele- 
ments in v1=0. 


4.3.5 Standard Functions 


A number of standard functions for data alignment 
are defined. Some of these work on variables with 
arbitrary range of parallelism, while others are 
defined for fixed size variables. 


shift (parallel array 1 selector, i) 


The function shifts the parallel variable, i steps, 
along its first dimension. Zero elements are shif- 
ted in from the edge. This corresponds to moving 
data up or down the Associative Array. 


rotate (parallel array | selector, i) 


Similar to the shift function except that the ele- 
ments that are shifted out at one edge are shif- 
ted in at the opposite edge of the parallel varia- 
ble. 


propagate (selector, i) 


The propagate function copies all true elements in 
the selector to the i following elements. 


var si: selector[1..10]:=(3,6=>true, others=> 
false); | 
s2 : selector[1..10]; 
vegin 


(* s2 will be true in 
elements: (3,4,5,6, 
T30). %) 


s2:=propagate(s1,2); 


exchange (parallel array 1 selector) 


The elements of the variable are pairwise inter- 


changed using the Exchange Network. The range of 
the variable must be even. 


shuffle (parallel array 1 selector) 


The variable is shuffled using the Perfect Shuffle 
Interconnection Network. The function is only de- 
fined for parallel variables which have a range 
corresponding to the size of the Associative Ar- 
ray. 


first (selector) 


This function finds the first true element ina 
selector and returns a new selector with only this 
element true. 


next (selector) 


The next-function returns the same value as the 
first-function. The difference is that the first 
true element in the parameter automatically is 
assigned the value false. 


any 


The any-function returns the value false if a pre- 
vious call to the first- or the next-function re- 
turned an all-false selector, otherwise it returns 
the value true. 


some (parallel boolean expression) 


A call to the some-function evaluates the boolean 

expression and returns the value true if it con- 

tains at least one true element, otherwise it re- 
turns the value false. 


var pari : parallel array[0..9] of integer; 
sel1 : selector[0..9]; 
su ; integer; 


begin 
su:=03 
sell:=pari[*] > 10; (* select elements grea- 
| ter than 10 *) 
while some(sel1) do 
su:= su + par1[next(sel1)]; 


(* su contains the sum of all elements in part 
whose value > 10 *) 


responders (parallel boolean expression) 


The responders-function evaluates the boolean ex- 
pression and returns the number of true elements © 
in the result. 


range (parallel expression) 


Returns a selector of range 256 with true elements 
indicating the range of parallelism for the ex- 
pression. 


4.3.6 Input and Output 


The Pascal standard procedures read and write are 
extended to allow input and output of parallel va- 
riables. Either complete parallel arrays or selec- 
ted subsets can be read and written. 


5. Programming examples 


The use of Pascal/L in two different applications 
1s presented in the following programming examp- 
les. 


The first example demonstrates one common opera- 
tion on a relational data base, namely the PROJECT 
operation: 


PROJECT R1 OVER A GIVING Re 


where R1 and R2 are relations. and A an attribute 
of R1. This operation creates a new relation, R2, 
from R1 by discarding attributes other than A. 
After that, all redundant tuples are removed from 
Re. Each relation has a corresponding mark selec- 
tor that indicates where tuples are defined. A 
description of the operation can be found in [13]. 


Program Project; 
var rimark : selector[0..255]; (* Selects words 


containing r1 *) 


romark : selector[0..255]; (* Selects words 
containing r2 *) 

temp! : selector[0..255]; (* Marks remaining 
tuples in r1 *) 

temp2 : selector[0..255]; (* Marks all du- 
plicates of the 
tuple that is 
under compari- 
son *) 

ri: parallel array[0..255] of record 

a,b,c : string[20] 

end; 

instance : string[20]; 

begin 


. . . (* Relation r1 is input and rimark is 
initiated *) 
temp1i:=rimark; 
instance:=r1[first(rimark)].a; (* Select first 
instance of 
attribute a *) 
while any do 


begin 
temp2:=(instance=ri[temp1].a); (*Select du- 
plicates *) 


temp1[temp2]:=not (temp1); (* mark as ana- 
lyzed *) 

romark[first(temp2)]:=true (* the first is 
included in 


re *) 
instance:=ri[first(temp1)].a; ( * get next 
distinct 
instance 
of attri- 
pute a *) 
end ; 
end. 


The second example shows how the FFT algorithm 
can be programmed on LUCAS. For details of the 
use of the Perfect Shuffle and Exchange Networks 
in FFT, see [12,14]. 


Program FFT; 


const iterations=8; (* 8 iterations for 256 point 
FFT *) 
type complex = record re,im: real end; 
var omega : parallel array[128..255,1..itera- 
tions ] of complex; 
product : parallel array[128...255] of 
complex; 
samples : parallel array[0..255] of com- 
plex; 
spectrum : parallel array[0..255] of com- 
plex; 
iter * anteser: 
lower : selector[0..255]:= 0..127=>false, 
others=>true); 
even : selector[0..255]:= 
(0..254 step 2=>true, others=> 
false>; 
begin 


. - . (* input samples and complex constants 
omega *) 
spectrum:=samples; (* spectrum after 0 itera- 
tions *) 
for iter:=1 to iterations do 
begin 
product [lower].im:=spectrum[lower].re * 
omega[lower].re - 
spectrum[lower].im * 
omega[lower].im; 
product [lower].im:=spectrum[lower].re * 
omega[lower].1mt+ 
spectrumLlower].im * 
omegallower].re; 


where even do 


begin 
spectrum.re:=spectrum.re + shuffle(pro- 
duct.re); 
spectrum.im:=spectrum.im + shuffle(pro- 
duct.im); 
end 
elsewhere 
begin 


spectrum.re:=spectrum.re - exchange(shuff- 
le(product.re)); 
spectrum.im:=spectrum.im - exchange(shuff- 
le(product.im)); 
end 
end; (* for-loop *) 


(* FFT spectrum is found in array 'spectrum' 
with bit-reversed index *) 
end. 


6. Summary 


The LUCAS associative array processor is intended 
as a working tool for research in the field of 
associative processing and some related applica- 
tion areas. In this paper programming aspects are 
investigated and a high level language, Pascal/L, 
is proposed. 


Pascal/L is defined as a superset of Pascal and 
includes the following extensions: 

parallel variables that are allocated to the 
Associative Array 

an indexing scheme to access parts of the paral- 
lel variables 

expressions that include parallel variables 

an extended control structure, where parallel 
expressions are used to control the execution 
standard functions for data alignment of paral- 
lel variables 

extended input and output. 


Appendix A. 
Syntax summary of the extensions te Pascal 


Data Declarations 


<selector type> ::=selector [constant..constant] | 
selector [constant..constant] := <boolean 
aggregate> 


<choice> => <boolean va- 
lue> | 

<choice> => <boolean value> , others => <boo- 

lean value> 


<boolean aggregate> :: 


<choice> ::= constant | constant..constant| 
constant..constant step constant 


<boolean value> ::= true | false 


<parallel array type> :: 
parallel array [constant..constant ] 
of <parallel component type> | 
parallel array [constant..constant , constant 
. constant | 
of <parallel component type> 


<parallel component type> ::= <parallel type> 


— 
Cee 


<parallel type> : <parallel type identifier> | 
<parallel standard type> | record <parallel 


field list> end 


<identifier> 


<parallel type identifier> : 


integer | real | boo- 
lean | char | 


string[constant ] 


<parallel standard type> :: 


<parallel field list> ::= <parallel record section> 
{ ; <parallel record 


section>} 


<parallel record section> :: 
<field identifier> { , <field identifier>}: 
<parallel standard type> 
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Indexing 
<parallel indexed variable> :: 


<parallel array variable> 
<parallel array variable> 


[ <first index> ] | 
{ <first index , 
<expression> ] 


<parallel array variable> ::= <variable> 


<first index> :: 


<selector expression> | * | 
constant | constant..constant 


<selector expression> ::= <expression> 


Statements 


<where statement> :: 
where <parallel boolean expression> do <sta- 
tement> | 
<parallel boolean expression> do <sta- 
tement> 
elsewhere <statement> 


where 


<while and where statement> :: 
while and where <parallel boolean expression> 


do <statement> 


<parallel case statement> :: 
case where <parallel expression> of 
<case list element> { 3; <case list ele- 
ment> } end | 
case where <parallel expression> of 
<case list element> { 3; <case list ele- 
ment> } ; 
<statement> end 


<case label> { , 
label> } 


others : 


<case list element> <case 


: <statement> 


ee 
e@ 
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Abstract -- A two level decomposition strategy for 
wafer scale implementation of CHiP processors is 
presented. With current technology, machines with 
over 300 processors per wafer can be fabricated. 
These wafer scale machines will be cheaper, faster and 
more reliable than their counterparts implemented 
with single chip components. 


Introduction 


Many different architectures for parallel proces- 
sors have been proposed but few large-scale parallel 
systems have actually been built. One reason is that a 
large-scale parallel processor consists of a great many 
components, This introduces severe practical prob- 
lems of construction, wiring and reliability. If the 
number of individual components could be decreased, 
parallel processors would be far easier and cheaper to 
construct. 


The absolute minimum number of components is 
reached when the entire parallel processor is fabri- 
cated on a single silicon wafer, These wafer scale sys- 
fems have greatly reduced cost due to the increased 
level of integration. Reliability is higher since the con- 
nections between processors are implemented in sili- 
con. Furthermore, there is the potential for increased 
performance since data values passed between proces- 
sors are not driven off the wafer. 


Wafer scale integration (WSI) has been previously 
attempted by discretionary wiring [1]. Due to the 
additional masking steps required, this has not proved 
to be practical, Other researchers are currently inves- 
ligating laser restructuring [2] and fuse blowing 
approaches to implementing WSI. 


At the center of our approach is the configurable, 
highly parallel (CHiP) processor [3] family of restruc- 
turable architectures, CHiP computers are composed 
of many simple processing elements (PEs) that are not 
directly connected together but are inserted at regu- 
lar intervals into a switch lattice. The programmable 
switches can be set to connect the PEs in a wide 
variety of interconnection patterns. 


We propose wafer scale implementation of CHiP 
processors, No extra masking steps are required mak- 
ing the approach cost effective. A two level methodol- 
ogy decomposes the problem into mapping small CHiP 
machines into building blocks and then structuring 
the building blocks on the wafer. Although we consider 
CHiP computers, the concepts presented are entirely 
general and can be applied to other parallel systems. 


(a) Research supported in part by Office of Naval Research under 
paninece Nos. NOQ0014-80-K-0816 and Contract N0Q0014-81-K-0360, 

Author’s present address is Dept. of Computer Science, U. of 
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Implementing Wafer Scale Integration 


A large number of simple PEs can be patterned o:. 
a single wafer. But on any given wafer, many of the PEs 
will contain defects - errors in the circuitry such as 
broken wires or nonfunctional transistors, These 
defects are randomly distributed over the wafer sur- 
face, 


To implement a wafer scale system, all PEs on a 
wafer are tested, and then the good PEs are connected 
together, The wafer is structured so that the presence 
of faulty PEs is masked and only functional Pks are 
used. The switch lattice of CHiP processors provides 
the interconnection flexibility required to structure 
the wafer. Switches can be programmed to route 
around faulty PEs and connect together only the func- 
tional PEs. Redundant switches are added to the lat- 
tice to perform the structuring. 


The structuring problem is made difficult by low 
PE yield; for any particular PE it is very unlikely that 
all its four neighbors will also be functional. The posi- 
tioning of good PEs on the wafer differs from the 
required connection pattern. Hence considerable wir- 
ing may be required to connect a PE to its neighbor in 
the CHiP lattice. This introduces delays from the 
intervening switches and increases signal transmission 
time. System speed is proportionately reduced. 


Now suppose that most PEs are functional. The 
good PEs are distributed in a more regular pattern - 
one more closely resembling a lattice. This simplifies 
the structuring problem. Faulty PEs can be eliminated 
by column exclusion, all PEs in a column containing a 
faulty PE are eliminated, and the columns adjacent to 
the excluded column are connected together. The only 
requirement is that we can wire around faulty or 
unused PEs. This strategy has been used previously in 
64K memories and in Batcher’s MPP. 


For this simple approach to be practical, the 
wafer must contain very few faulty PEs. But due to the 
nature of the integrated circuit manufacturing pro- 
cess, high yield is achievable only with very simple 
circuits - much less complex than a PE. 3 


But suppose the units patterned on the wafer are 
riot individual PEs but building blocks of a CHiP 
machine, Each block is itself a small CHiP processor. 
By providing sufficient redundancy within each block, a 
smaller but completely functional CHiP machine (the 
virtual lattice) will exist within almost every building 
block. With each block contributing a small, fixed size 
virtual lattice, a large CHiP machine is forrmed from 
the blocks. | 


If enough redundancy is provided within each 
building block, the percentage of blocks containing a 
smaller, completely functional virtual lattice will be 
very high. This allows the use of the column exclusion 
strategy to eliminate the relatively rare block that is 
completely dysfunctional. 


To determine the degree of redundancy required, 
we developed a yield model based on the Price model 
of the integrated circuit manufacturing process. This 
model is the basis for the quantitative determination 
of the effect of redundancy. 


The smaller virtual lattice must be mapped into 
the larger building block. This mapping makes the 
block function as if it were a virtual lattice. An 
observer of the input/output behavior of the block 
would be unable to distinguish it from a virtual lattice. 
The mapping associates each vertex (PE or switch) in 
the virtual lattice with an image in the block. Further- 
more, every datapath in the virtual lattice becomes a 
path in the block. A path may be a single datapath or 
a sequence of connected switches. 


In summary, we have introduced a two level 
decomposition of the wafer scale lattice. A very large 
CHiP lattice is patterned on the wafer. It is logically 
divided into small building blocks. From almost every 
building block a small fixed size CHiP processor is 
extracted, and the blocks are composed to form the 
wafer scale machine. Faulty blocks are eliminated by 
column exclusion. Note that the two level decomposi- 
tion limits the length of paths between PEs to the size 
of a block. This assures that system performance will 
not be catastrophically degraded by the occurrence of 
faulty PEs, 


Designing Building Blocks 
Each PE has a simple arithmetic-oriented instruc- 
tion set, an 68-bit ALU and 64 bytes of memory, This is 
sufficiently powerful to execute a wide variety of sys- 
tolic algorithms. Implemented with current (Zum) 
technology, each PE occupies approximately a 1.75mm 
x 1,.75mm region of silicon. | 


A 2x 2 virtual lattice is mapped into a building 
block. From the yield model, the cumulative probabil- 
ity density function of defects (Fig. 2) for relative area 
A = 1.0 and 2.0 is known. From this we can derive [3] 
the probability that at least 4 out of N Pls are func- 
tional (Table 1), 4 good PEs can be found out of a set of 
ig in 99% of the time. Consequently, each building 
block is chosen to be 4 PE x 3 PE CHiP machine insur- 
ing that almost all blocks contain the 4 PEs required 
for ad X 2 virtual lattice. In addition to redundant PEs, 
each building block also has twice the required 
number of switches (Fig. 1b). These redundant 
switches are used to map the & x é@ lattice into the 
block. 


A Wafer Scale CHiP Processor 


AQ x 9 grid of building blocks can be patterned on 
a 4" wafer (Fig. 3). The bonding pads and drivers 
required to connect the wafer scale CHiP machine to 
external memory (or other wafer scale machines) are 
placed around the periphery of the grid. Redundant 
drivers are used to guarantee the integrilv of the 
external connections. The grid occupies only 68% of 
the wafer area which leaves sufficient remaining area 
for 150 pads and drivers per lattice edge. Packaging 
constraints may place a lower limit on the number of 
external connections. 


Az xX @ virtual lattice is recovered from each 
block 99% of the time, and the occasional faulty block 
is eliminated by excluding the column containing the 
fault. Table 2 shows the frequency of different grid 
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sizes after faulty blocks are eliminated. Almost half of 
the wafers use all 81 blocks. An "average" wafer con- 
tains a CHiP processor of 297 PEs, The switches in 
unused or faulty blocks are used to connect the blocks 
in the columns adjacent ot a faulty block. Thus the 
"wire around” requirement for blocks becomes a "wire 
through” capability via the CHiP switch lattice. 


Is this approach efficient? In addition to excluded 
columns, only 4 of the 12 PEs in each block are used. 
On the average, there are 74 usable 2 x 2 CHIP lattices 
in each wafer scale machine (Table 1). Suppose we 
simply pattern the entire wafer with 2 x 2 lattices. A 
4" wafer holds 288 of these. At the predicted 20% yield, 
only 58 of the lattices are fully functional. Hence fawilt 
tolerant building blocks containing redundant com- 
ponents are area efficient. The area lost to redun- 
dancy is more than made up for by the increased 
recoverability of the blocks. Moreover, the wafer scale 
solution is more robust to failures, has better perfor- 
mance and lower cost. 


Conclusions 


The two level decomposition could be a practical 
method for implementing wafer scale integration. It is 
cost effective since no additional processing steps are 
required, Additionally, the maximum wire length 
between Pks is limited, and the wafer area is efficiently 
utilized. 


As described above, our methodology benefits 
from the fact that the mechanism needed for structur- 
ing, the switch lattice, is an integral part of the archi- 
tecture. Although this simplifies our work, it is not 
necessary. The method is entirely general. It can be 
applied to other systems composed of uniform parts 
including parallel processors with fixed interconnec- 
tion structures, : 


Practical problems of testing, power consumption, 
synchronization and clocking, etc. are discussed in 
detail in [4,5]. 
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Table 1 - Recovery of 4 PEs frorn Woh 
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Table 2 - Size of Wafer Scale Processor for 9 x 9 Grid 


Lattice Size from a9 x 9 Grid 


cumulative size of CHiP 
probability processor ( PEs ) 
18 x 18 = 324 
18 xX 16 = 288 
16 x 16 = 256 
16x 14 = 224 
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Figure 3 - Layout of a Wafer Scale CHiP Processor 
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TESTING COORDINATION FOR "HOMOGENEOUS" PARALLEL ALGORITHMS 


Janice E. Cuny 
Lawrence Snyder 
Department of Computer Sciences 
Purdue University 
West Lafayette, Indiana, 47907 


Abstract: A collection of parallel processors is said 
to be coordinated if each write from one process- 
ing element (PE) to another is answered by a 
read. We report on an efficient algorithm to test 
coordination for parallel programs in which the 
code for each PE is a loop. We also test a weaker 
predicate for parallel algorithms with oblivious PE 
codes and we show that the general problem is 
PSPACE-hard. 


"Homogeneous parallel algorithms” refers to 
a large class of parallel computations, reminis- 
cent of cellular arrays, formed from "identical" 
processing elements (PEs) that often use pipelin- 
ing and novel interconnection structures. They 
include algorithms for matrix operations [1], 
dynamic programming [2], data base operations 
[3], sorting [4], and signal processing [5]. The 
algorithms in this class are motivated by the 
potential for direct VLSI implementation but they 
are equally well suited for implementation as pro- 
fl for general purpose parallel architectures 
6}. 

Upon close scrutiny many of these algorithms 
are anything but "homogeneous". The processing 
requirements of PEs may differ because of initiali- 
zation details, termination details, timing and 
edge effects (ie. special problems encountered 
when a PE is on the perimeter of a processing 
array). There is a benefit in retaining the concep- 
tual simplicity of homogeneity and relegating the 
differences to the status of implementation 
details. This is because, at a high level, many pro- 
cessors are identical and their differences on 
lower levels can largely be inferred from the 
algorithm's global structure. A key goal, then, in 
the effort of simplifying parallel] algorithm 
development is: 


To support "homogeneous" simplification by 
automatically generating PE variants when 
possible and to assist in their development 
when manual design is required. 


This work is part of the Blue CHiP Project. It is supported in 
part by the Office of Naval Research Contracts N00014-80-K- 
0816 and N00014-81-K-0360. The latter is Task SRO100. 


0190-3918 /82/0000/0265$00.75 © 1982 IEEE 


265 


We report on progress towards this objective for 
variants that differ in the timing characteristics 
of their interprocess communication. 


We have reported [7] on the automatic syn- 
thesis of PE variants to synchronize interprocess 
communication. We start with a parallel algo- 
rithm which assumes an abstract data flow execu- 
tion mode and for a limited, but widely practical 
class of algorithms, we generate the timing neces- 
sary for synchronous execution. But what if the 
algorithm is not in the class or if manual design is 
required? In this paper, we report on algorithms 
that assist the designer by testing the compatibil- 
ity of interprocess communication. 


A MODEL OF PARALLEL PROGRAMS 


Our abstraction of a homogeneous parallel 
processor is an Jnterprocessor Communication 
(IC) System. An IC system is given by a set of m 
finite state machines, M,,Mo,....Mm, each describ- 
ing the input/output behavior of a single ae The 
alphabet of the machine is the power set of 


Tog, Wig|tEl[m] A ced} 


where £ is a finite set of values, r;, denotes the 
reading of the value a from PE j, w;,, denotes the 
writing of the value o to PE j and gy, the empty 
set, represents any other operation not involved 
in interprocessor communication including opera- 
tions that transfer values to and from the exter- 
nal environment. If PE i writes to PE j or PE j 
reads from PE 7, we say that there is a communi- 
cation link from i to 7. Notice that the intercon- 
nection graph of the processors is implicit in the 
indexes of the symbols. | 


We assume that the PEs execute synchro- 
nously and that on each step a PE can execute a 
set of operations simultaneously. Specifically, the 
execution of an IC system is defined by two 
sequences, C!,c?,c3,... and @°,q@!,Q?,.... Each ele- 
ment of the first sequence is an an m-vector of 
symbols, one per PE, describing the operations 
executed ina single step. Each element 9* of the 
second sequence is an mxm matrix of strings, 


Tic systems can be defined more generally [8] but for the 
purposes of this paper, we present only a limited version. 


TT [m ] denotes the set {1,2,...,.m}. 


where gf; gives the status a the communication 
link from PE i-to PE j. The gf; are all of the form 
af with ae 2’ and g € (Z7')* where ="! is the set of 
inverse symbols of ©. We interpret gj; = af to 
mean that the symbols a have been written on the 
link but they have not yet been read and the sym- 
bols @ have been requested from the link but they 
have not yet been written. A symbol o and its 
inverse o! cancel at the boundary between a and 
8, i.e. oo! =X, the empty string. 

In general, fork =1andie[m], cf is the k-th 
symbol in a word defined by M, and 


Lartinad 


Cyvreg cg’ ‘oF E L(M;) , 


Initially, @° is empty, i.e., go; =A for all i,j € [m]. 


Generally, gff! = a-q,5°b where 
Gg if Wy Ee oft 
cence (a otherwise 


and 


if Ty , € eft} 
otherwise 


r 


and a sequence @°,g!,Q*.... 


if and only if 
(gis € DB") v (gf; € (Z71)*). 


is a legal computation 


The latter condition enforces our intention that a 
PE reads the same symbol that was written to it in 
the corresponding write. 


An IC system is said to be strongly coordi- 
nated if for alli, j, and k, 


qi; =A 


that is, during synchronous execution correspond- 
ing reads and writes occur simultaneously. ' If we 
allow the writes to precede their corresponding 
reads we say that the system is weakly coordi- 
nated; for alli, 7, andk 


af; E{APUE A 
((k>0A gtr CBA qf ;=a-qt5'b) =>a=)). 


RESULTS 


In this work, we consider algorithms to 
answer the question 


Given an IC system, is it strongly (weakly) 
coordinated? 


ft We permit simultaneous reading and writing for technical 
simplicity, but the more conventional unit time delay 
between writing and the subsequent reading requires only 
more complicated, not substantially different, definitions. 
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for three cases of increasingly complex IC struc- 
ture. We develop algorithms for testing coordina- 
tion in the first two cases: loop programs in which 
all PEs repeat a single cycle of operations and 


oblivious programs in which restrictions on legal 


computations are removed. In the third case, 
consisting of general IC systems, we show that 
testing is computationally intractable. 


Loop Programs 


We first restrict our attention to loop pro- 
grams in which each PE executes an initialization 
sequence and then repeatedly executes a single 
cycle of instructions. While this restriction seems 
prohibitive, many highly parallel systems, such as 
the systolic processors [1], can be characterized 
in this way. 

We have developed an algorithm to test strong 
coordination oe a single communication link of a 
loop program. t The algorithm checks the first 
MAX + LCM execution steps of the machines for 
coordination errors, where MAX is the length of 
the longer initialization sequence for the PEs 
involved and LCM is the least common multiple of 
their cycle lengths. This test is sufficient 
because, for k > MAX + LCM, each PE executes the 
same operations in time step & that it does in 
time step (k-MAX) mod LCM. The algorithm, with 
a few modifications can be used to test weak coor- 
dination as well. In both cases, it requires O(n?) 
time where n is the maximum number of states in 
the machines involved. 


If we assume that a system is composed of a 
small number of distinct PE types which are inter- 
connected in analogous ways, then it is sufficient 
to test each link type just once. For a system 
with ¢ link types, we have 


Theorem 1. The coordination of a system of 
interconnected, loop programs can be tested 
in O(c-t) where c is a constant dependent on 
the loop structure of the PE code. 


Notice that the bound is independent of the 
number of PEs and is influenced only by the 
variety of their communication, which would be 
small for "homogeneous" algorithms. 


Oblivious Programs 


Generalizing, we allow arbitrary finite state 
machines but we remove the restriction on legal 
computation sequences. In these oblivious pro- 
grams, it is impossible to branch based on the 


tt The complete details of our algorithms and a are 
presented in the full version of this paper [9]. 


values transmitted between PEs. For such sys- 
tems, we can test only worst case coordination, 
answering the question 


Given a communication link, does it have a 
potential coordination error? 


If our algorithm reports NO, then the communica- 
tion link is coordinated; if our algorithm reports 
YES, it is possible that the detected error would 
never occur in any legal computation of the sys- 
tem. 


The algorithm first constructs the "cross pro- 
duct'"' machine for the two finite state machines 
involved. For strong coordination, the testing 
question is then reduced to a question of state 
reachibility in this new machine. The test, there- 
fore, requires O(q) time where q is the number of 
states in the cross product machine; in terms of 
the original machines, the algorithm requires 
O(n") time. For weak coordination, we reduce the 
test to a predicate on the computation tree for 
the cross product machine. We show that we can 
determine the value of this predicate in 
0(q°) = O(n5) time. For systems with ¢ interface 
types, then, we have 


Theorem 2. The worst case coordination of an 
IC system can be tested in O(d-:t) wheredisa 
constant dependent on the structure of the 
PE code. 


Again, the result is dependent only on the variety 
of the PEs not necessarily their number. 


General IC Systems 


In the most general case, given an IC system 
with arbitrary structure and data dependent 
branching, we show 


Theorem 3: Testing the coordination of arbi- 
trary IC systems is PSPACE-hard [10]. 


The proof of this theorem involves reducing the 
language recognition problem for linear bounded 
automata to our testing question. 


CONCLUSIONS 


Although the complexity theory results indi- 
cate that coordination testing is a very compli- 
cated task, it is important to notice that many 
recently developed parallel algorithms are 
covered by Theorem 1. The testing algorithms 
discussed here are being implemented and we 
expect that they will be of significant assistance in 
the development of parallel programs. 
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MPP VLSI MULTIPROCESSOR INTEGRATED CIRCUIT DESIGN 


John Burkley 
Digital Technology Department 
Goodyear Aerospace Corporation 
Akron, Ohio 44315 


ABSTRACT 

A large scale integrated multiprocessor circuit 
has been developed for use in the Massively 
Parallel Processor system (MPP). The chip, built 
in an HCMOS technology, contains eight bit-serial 
processing elements (PE°s) and is the basic 


building block for the MPP processing array. 


INTRODUCTION 
The MPP is a large scale single instruction 
stream, multiple data stream (SIMD) machine being 


built by Goodyear Aerospace Corp. for NASA/GSFC. 
(1,2,3). The systen block diagram is shown in 
Figure 1. The Array Unit (ARU) processes’ two 
dimensional arrays of data. Array control is 
generated by the Array Control Unit (ACU) which 
executes the user program and performs any 
sequential processing and scalar arithmetic 
necessary to support array operations. Array data 
I/O is through a special Staging Memory which both 


stores and permutes array data. The Program and 
Data Management unit serves as an external I/0 
preprocessor. 

The ARU makes the MPP special; it incorporates 


16348 PE“s organized in a 128 x 128 array and 
operating at a basic cycle of 100 nsec. Each PE 
supports boolean and arithmetic operations, is 
maskable and is capable of routing data to its 
orthogonal neighbors. Table I shows the speed of 


typical operations. 
| 
Ea 
MEMORY 
SES 


128 BIT 
INPUT 
INTERFACE 


PROGRAM & DATA 
nN GENENT UNIT 
PDMU 


Fig. 1 - MPP Block Diagram 


128 BIT 
OUTPUT 
INTERFACE 


an SO 
HOST 
COMPUTER 


To build an array of this size and speed required 
the development of a VLSI chip. The chip is 
partitioned 
2 x 4 array, an eight bit bi-directional data port 
with a parity tree and a SUMOR tree, and a disable 
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TABLE I - SPEED OF TYPICAL OPERATIONS 


EXECUTION 
OPERATIONS SPEED 


ADDITION OF ARRAYS 


8-BIT INTEGERS (9-BIT SUM) 
12-BIT INTEGERS (13-BIT SUM) 
32-BIT FLOATING-POINT NUMBERS 


MULTIPLICATION OF ARRAYS 
(ELEMENT-BY-ELEMENT ) 


8-BIT INTEGERS (16-BIT PRODUCT) 
12-BIT INTEGERS (24-BIT PRODUCT) 
32-BIT FLOATING-POINT NUMBERS 


MULTIPLICATION OF ARRAY BY SCALAR 


8-BIT INTEGERS (16-BIT PRODUCT) 
12-BIT INTEGERS (24-BIT PRODUCT) 
32-BIT FLOATING-POINT NUMBERS 


“MILLION OPERATIONS PER SECOND 


circuit capable of disconnecting the chip from its 
east-west neighbors. This last feature 
facilitates automatic repair of the array using 
redundant processing elements. The chip replaces 
some 200 MSI and SSI circuits. The chip will 
execute ten million operations per second when 
operating with high speed RAM (45 nsec access). 
PE memory was not included within the chip for 
several reasons. First, local memory would have 
reduced the number of PE“s per chip and 
complicated its design and development. Second, 
the use of external memory allowed the MPP system 
to take full advantage of existing memory 
technology, allowing more memory per PE at a 
faster access time than is possible in HCMOS. 
Finally, future systems could expand PE memory 
without a chip redesign. A total of 2112 chips 
are required to construct an MPP array. This 
total includes a spare column of chips (4 columns 
of PEs) for redundancy. 


CHIP DISABLE 


A chip disable line is provided which logically 
disconnects the chip from the array by disabling 
the SUMOR output and enabling a bypass circuit 
which routes data directly from the west route and 
S register inputs to the east route and 5 register 
out puts. This logically removes that chip from 
the array allowing column substitution. Since 
only a small portion of chip logic must work for 
the bypass logic to be functional, a failed array 
could be repaired automatically by substituting a 
spare column of chips for a failed column without 
waiting for a maintenance call. 


PE DESIGN 


The PE includes six single bit registers 
(A,B,C,G,P,S), a variable length shift register, a 


full adder and some combinatorial logic. A PE 
logic diagram is shown in Figure 2. The chip is 
controlled by 16 control lines. The PE may be 
divided into four subunits; logic and routing, 


arithmetic, I/0, and masking. These subunits have 
independent control but share a common clock. The 


subunits are interconnected by a bi-directional 
data bus which also connects to external PE 
memory. 


LOGIC & ROUTING SUBUNIT 


The logic and routing subunit is formed by the P 
register together with some supporting 
combinatorial logic. P can be logically combined 
with the state of the data bus and the result is 
Stored in P. When routing is enabled, one of four 


inputs to the route multiplexor is selected and 
latched in P. The multiplexor inputs are the 
states of the P registers in the north, south, 


east, and west neighbor PE“s. 


ARITHMETIC SUBUNIT 


The arithmetic subunit consists of a serial-by-bit 
adder formed by B and C and a variable length 
shift register whose output may be stored in A. A 
may also be loaded from the data bus. The adder 
receives an input from A and P. When enabled by 
control the adder adds the two input bits to a 
carry bit stored in C and forms a two bit sum. 


The least significant bit is stored in B and the 
most significant bit is stored in C so it becomes 
the carry bit for the next cycle. C may be 


initialized to either a one or a zero. 


The arithmetic unit also includes a variable 
length shift register for local storage of partial 


yee eee wee mew eee ee eee wee ee ee ee eee ee ee ee eee ee ee me ee ee eee eee wee eee 


products. This feature significantly improves 
multiply and divide operation times. The shift 
register circulates the output of B back through N 
Stages of delay to the adder input register A. The 


length of the shift register, N, may be set in 
steps of 4 from 2 to 30. Since A and B also add 
two stages of delay, the total shift register 


length may vary from 4 to 32. 
I/O SUBUNIT 


The 1/0 subunit is formed by the S register and a 
two input multiplexor which selects input from 
either the data bus or the S register of the PE”s 
west neighbor. S register shifting may go on 
independent of other PE operations except when 
data must be stored or loaded from PE memory. 


MASKING SUBUNIT 
The masking subunit is formed by the G register. 


Masking is enabled when G is low. Routing and 
arithmetic operations may be masked separately. 


In addition, the state of P may be outputted to 
the data bus selectively negated by G. This 
allows a masked invert of data in PE memory to be 
executed in two cycles. 

MEMORY INTERFACE 

The PE subunits are interconnected by a 
bi-directional data bus. This bus may be used to 


exchange data between PE registers or to read and 
write PE memory. The control lines allow only one 
bus source at a time. The chip includes an eight 
bit parity tree which generates parity on memory 
write operations and checks parity on memory read 
operations. If bad parity is detected a parity 
error latch is set. Because of the parity tree 
delay, memory operations with parity will operate 
at a 120 nsec cycle. Parity may be ignored for 
1OMHz operation. The eight memory buses are also 
sumored to form a single bit output. 


CLK . ' 
7 ! fe ADDER 
CLK c 1 CLK c 1 CLK ae 
4-BIT SHIFT leyiQ Die—4 MUX 8-BIT SHIFT Q Db MUX 16-BIT SHIFT 
' = REGISTER J) z 5 0 REGISTER 5 9 REGISTER 
SR LEN ses LS pa Ewa Ls ' 
SELECT IT 
' L SHIFT REGISTER a 7-1 BENB 
' ( 
a CENB 
C Q ‘ 
A I 
‘ ARITHMETIC SUB-UNIT ® —CRESET 
! i S| CCL 
BUS 
ee ee: See ae ime | ba Fee eee en eee ey nye ~ 
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4 
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' 1 | ae 
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E 74 R P 1 oy 0S a 7 S-OUT 
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Fig. 2 - MPP Processing Element Logic Diagram 
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SOUTH 4 
SOUTH 5 


SOUTH 6 
SOUTH 7 


K 
LOGIC 
$/0 S/R S/R S/R S/R S/R S/R S/R 
0 ] 2 3 4 5 6 7 


Fig. 4 - MPP Chip Topology Drawing 
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CHIP FABRICATION 


circuit was 
Inc. in an 


The MPP multiprocessor integrated 
fabricated by Solid State Scientific 
HCMOS technology using 5um design rules. The. 
design was implement ed using about 8000 
transistors and required a chip size of 235 x 131 
mil?. The chip requires two power supplies. 
Internal circuitry operates at 7 volts; the output 
translators require 5 volts. The chip is packaged 
in a 52 pin flat pack and dissipates 550mw when 
operating at 10 megahertz. 


The chip design includes a high speed 
bi-directional data bus. This data bus was 
iaplemented using NMOS transistor pull-downs and a 
current mirror biased pull-up transistor. This 
bus implementation increased chip power 
dissipation but greatly improved response time. 


The chip photomicrograph is shown in Figure 3; its 
topology drawing is shown in Figure 4. The eight 
PE“s are yrouped together in the middle of the 
chip in long narrow strips. This was done to 
minimize control line metal runs. The data bus 
and routing logic are grouped toward the top of 
the chip. This logic had the most severe timing 
constraints and was laid out as close to the chip 
pins as possible to minimize line delays. The 
shift registers are grouped along the bottom of 
the chip. The control decode is split and fed in 
from both sides of the chip. : 


CONCLUSIONS 


An eight PE multiprocessor chip has been developed 
for use in the MPP. The PE chip design meets all 
the functional and critical timing specifications 
first proposed by Batcher (2) in 1979. 
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EFFICIENT PARALLEL ALGORITHMS FOR PROCESSOR ARRAYS 
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Abstract 


With the advent of VLSI technology, it is 
possible to provide extremely high but inexpensive 
computational capability with a system consisting 
of a large number of identical processors organ- 
ized in a simple, regular structure. In order’ to 
exploit the high computation capability of the 
arrays, however, it is important to employ an 
efficient parallel algorithn. In this paper a 
measure is proposed which can calculate the effi- 
ciency of an algorithm performed in a processor 
array. This measure is used to compare several 
proposed array architectures for a variety of 
algorithms. Finally, efficient parallel algorithms 
for recursive filtering problems, matrix-vector 
multiplication, and matrix multiplication are also 
proposed. 


1 Introduction 


Problems such as weather prediction, seismic 
data analysis, and signal and image processing 
have to process extremely large amounts of data, 
but even the fastest existing computer cannot 
satisfy these demands [1]. A solution to the need 
for high computational power is the connection of 
a large number of identical processors or process- 
ing elements (PEs). Each PE has limited private 
storage, and in order to not restrict the number 
of PEs placed in an array, each PE is only allowed 
to be connected to some neighboring PEs. Thus, all 
PEs are arranged in a well organized structure 
such as a linear array or two-dimensional array. 
With VLSI technology, the processor arrays can be 
implemented in one chip or ina number of identi- 
cal chips, and the hardware cost increases only 
linearly with the number of processors in the 
array. 


Systems using a large number of PEs include 
the MPP (massively parallel processor) [2], the 
CLIP family [3], and systolic arrays [4,5]. We 
refer to these arrays as processor arrays in this 
paper. They are usually used as peripheral proces- 
sors which perform computation intensive tasks. 
Figure 1 shows a typical processor array employed 
in a computer system. All the data transferred 
between the host system and the processor array 
has to pass through the bus and PEs in the boun- 
dary of the processor array; this may cause some 
of PEs to be idle at some time. However, the data 
transfer ina processor array can be overlapped 
with computation. 


*This research was supported by the Naval Elec- 
tronics Systems Command under VHSIC contract 
N00039-80-C-0556. 
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Processor 


Array 


Fig. 1. A processor Array as a Peripheral Processor 


in a General System 


It is therefore important to employ an effi- 
cient parallel algorithm to exploit the high com- 
putation capability of the arrays and reduce their 
idle time. In this paper, the efficiency of 
parallel algorithms for a computation task per- 
formed in a processor array is investigated. A 
measure of efficiency is proposed in section 2. 
It is used to measure the efficiency of some 
parallel algorithms in section 3. Section 4 pro- 
poses new algorithms and processor arrays’ to 
obtain a better performance for the computation 
tasks which are discussed in section 3. 


2 The Criteria for a Measure of Efficiency 
for Parallel Algorithms 


In this section a measure will be given for 
the efficiency of an algorithm when it performs a 
computation in a processor array. It is obvious 
that the number of required PEs (P) and the tur- 
naround time (T) of the computation are two of 
the factors which affect the efficiency ofa 
parallel algorithm. Another factor is the required 
data transfer bandwidth of the algorithm to send 
data into the array; an algorithm requiring a 
large data transfer bandwidth easily saturates the 
bus in Figure 1, and reduces the system perfor- 
mance. The data transfer bandwidth, B, is defined 
as the maximum number of words which have to be 
transferred through the I/0 ports of the boundary 
PEs in a time unit (a time unit is defined as the 
period of time a PE performs an operation). 
Furthermore, the importance of the bandwidth is 
more obvious when a processor array is applied in 
a real-time environment with a large volume of 
data. 


Consider a computation task with C operations 
which requires the transfer of I input and output 
operands for its execution, in a processor array 
consisting of P PEs. The turnaround time (T) is 


the time from the beginning of transferring the 
task to a processor array until the result is sent 
back to the host. This time should satisfy equa- 
tion (1) below, since the completion of an opera- 
tion requires one time unit, and up to P opera- 
tions can be done by the processor array in each 
time unit. The turnaround time, T, should also be 
greater than or equal to the number of words that 
can be transferred on the bus per time unit, I/B, 
to complete the data transfer. 
T > C/P (1) 
T > I/B (2) 
(3) 


From (1) apd (2) we get, 
PBT > CI. 

The product of P, B, and Tt? is the Space- 
Time-Bandwidth complexity of an algorithm executed 
in a processor array. Equation (3) shows. that 
proguct CI is the lower bound of the complexity 
PBT”; an optimal algorithm has a value of PBT 
approaching the lower bound. 


The ratio of the complexity, PBT“, of an 
algorithm to its lower bound CI, represented as R, 
is a measure of the efficiency of the algorithm. 
In the rest of this paper, the complexity PBT and 
ratio R are used to measure the efficjency of an 
algorithm. The lower the value of PBT”, the higher 
the performance of the algorithm in some sense; 
also, R=1 implies that the algorithm is optimal, 
and a large value of R means that the algorithm is 
inefficient. . 

3 Measuring the Efficiency of 
Some Systolic Algorithms 

The systolic array architecture [4,5] pro- 
posed by Kung is a kind of special purpose proces- 
sor array. The systolic algorithms provide well 
organized data flow through the arrays; once a 
piece of data is sent into a systolic array, it 
passes through the array and is fully exploited 
until its associated computations are done. Thus, 
more PEs can be kept busy and the communication 
requests between the array and the host system are 
reduced to aminimum. These are the primary fac~ 
tors which realize a high system performance. 


The PE primarily used in systolic arrays is 
an inner ee ae processor which consists of 
three registers: R » and Ri, - These registers 
are used to perférm ae following aiaalad a ant 
and addition in one time unit: R_ = Ro + R Rp 
Two different geometries of fnner® odie PES, 
which Kung defined and called type-A and _ type-B, 
are shown in Figure 2 (a) and (b). 


Aout A out 


(a) 


C out * C in + Ain * Bin 


Fig. 2 Three Types of Inner Product Step Processors 
(a) Type-A, (b) Type-B, (c) Type-c. 


In the sections 3.1 through 3.3, the effi- 
ciency of the systolic algorithms, for problems 
such as recursive filtering, matrix-vector multi- 
plication and matrix multiplication, is examined. 


The systolic algorithms perform well when process- 
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ing narrow band matrices (i.e. when processing 
band matrices with the widths of the bands. much 
less than the dimensions of the matrices) [5]. 
For dense matrix operations the advantages of the 
Systolic algorithms decrease. We will examine the 
efficiency of matrix-vector and matrix multiplica- 
tion algorithms in processing both narrow band and 
dense matrices; the boundary between the two cases 
will also be considered. 


3.1 Matrix-Vector Multiplication 
Consider a matrix-vector multiplication 
Y = A®X (4) 


where A is an by n matrix with elements a, 
and Y are two n-by~1 vectors with elements ty 
Vi respectively. The operation can be perf 
as follows: 


i? 


br 


aa 
med 


1<ign (5) 


Y¥, =3 a, 4x 
i j=1 i,j J 


Narrow Band Matrix-Vector Multiplication 


When the matrix A in the equation (4) is a= 

band matrix with the width of the band, W << n, 
the number of computations, C, is approaching W*n 
and the number of the words 5 = is approaching 
(W+2)*n, the lower bound gf PBT 
CI = (W ae 


The systolic algorithm for matrix-vector mul- 
tiplication in [4,5] required W processors, 
(W/2+1) units of bandwidth, and (2n+W) time units. 


Thus, 2 
PBT = W#(W/2+1)#(2n4W)* 
The ratio R is about 2. 


Dense Matrix-Vector Multiplication 


If the matrix Ain the equation (4) is a 
dense matrix, the value of C is n© and I is n°+2n; 
the lower. bound of PBT 

Cl=n +2n>, 
The systolic algorithm for matrix-vector multipli- 
cation in [4,5] required 2n-1 processors, n/2+1 
of bandwidth, and 3n time units. Thus, 

PBT* = (2 -1)(n/244) (3m), 

= Qn’ + 13.5n 

The value of R is about 9 for large n. All the 
formulas shown in this section are also shown in 
Table 1 (a) and (b) for comparison with the new 
parallel algorithms proposed in section 4. 


3.2 Recursive Filtering Problems 


Another application of the systolic array 
architecture is in evaluating a recurrence equa- 
tion which is used, for example, for recursive 
digital filtering in signal processing problems. 
An m-th order recurrence problem is defined as 

= Fi(x5_4) °9X;_,) for i21 (6) 
where F. is a given recurrence function and x, is 
calculated from its m _ predecessors. Assuiie Xs 
(i<0) is given. 


= 28(W°42W) #n-. 


# 
A band matrix is a matrix with elements a, ; 
ns which rJ 


=0 if j>oi+p or i>j+q 1<i,j<n and 1<p,qsn; 
“ta dhe width (W) of the band matrix is (p+q-1). 


In this computation, is m*n and I is n+n; 
the lower bound of the PBS is,found as follows: 
I=mn +m n. 


The complexity PBT“ for thg algorithm in [4] is 
m(1)(2n)“ = 4n“m. 
The value of R is about 4 for this algorithn. 
3.3 Matrix Multiplication 
A matrix multiplication is represented as 
C=AB (7) 


where matrices A, B, and C are n by n_ band 
matrices with elements a 5? b; K? and ¢; 
respectively. (A dense matrix is a’’Special case 
of a band matrix with a full band width.) Let W, 
and W. be the widths of band matrices A and B. 


The operation of equation (7) 
formed by calculating 


can be per- 


n 
= as 5" 5 for 1<i,k<n. (8) 


j=1 
Narrow Band Matrix Multiplication 


For a band matrix multiplication represented 
in equation (7) with the condition W,,Wo<<n, the 
product CI can be derived as follows: 


i,k = 


Czw *Wo*n 
C = 2(W, +Wo) Wy *Wo*n ° 


The complexity PBI“ of the matrix multiplication 
systolic algorithm can be derived as follows: 


P = W #Wo 
T = 3n+M : = min(W, »Wo) 
therefore, R = 3 
Dense Matrix Multiplication 


For a dense matrix multiplication, the 
duct CI can derived be aS FOLTOWE: 
n 
3n* 


pro- 


The complexity PBT of the systolic algorithm in 
[4,5] can be gerived as follows: 


P= 3n“ (Only 3n° out of 4n“ processors 
contribute toward the computation. 
B = 2n 
3 = 5n 
and PBT“ = 150#n>, 
with R = 50 
All the equations shown in this section are 


also shown in Table 2 (a) and (b) for comparison 
with other matrix multiplication parallel algo- 
rithms. 


3-4 Remarks 


The values of PBT“, CI, and R derived in sec- 
tions 3.1, 3.2, and 3.3 suggest that’ the systolig 
algorithms might be improved to obtain lower PBT 
values and better performance. 


From [4,5], we know that the systolic algo- 
rithms do not pipe data elements into every pro- 
cessor in every time unit in order to synchronize 
with other data streams when performing the 
expected computations. This, however, idles one- 
half to two-thirds of the processors at any given 


) 
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time. 


Three straightforward methods can be used to 
overcome this problem. First, independent, func- 
tionally equivalent computations can be inter- 
leaved in the array to obtain high throughput. 
However, functionally equivalent computational 
tasks do not always exist in the system at the 
same time. Another method is to partition a compu- 
tation into independent computations of the same 
size and interleave them in the array. Unfor- 
tunately, not all computations can be partitioned 
into independent computations; also, some overhead 
must be paid for end conditions. Finally, one 
processor element can be used to do the job of two 
(or three) adjacent processors, since the others 
are idle all the time in the existing schemes. 
Although this increases the efficiency of the sys- 
tem, this method still does not exploit all of the 
inherent parallelism in the given computation and 
will require a complicated data transfer and con- 
trol scheme. 


4 New Efficient Parallel Algorithms 
and Processor Arrays 


A data broadcast concept is introduced in 
section 4.1. Incorporating this concept into pro- 
cessor arrays for processing recursive filtering 
problems, matrix-vector multiplication, and dense 
matrix multiplication results in better  perfor- 
mance; these are discussed in sections 4.3, 4.2, 
and 4.4, respectively. In addition, more efficient 
algorithms for matrix multiplications using the 
two-dimensional hexagonal-connected systolic array 
[5] are given in section 4.4. 


4.1 Data Broadcast 


Data broadcast is defined as sending a data 
element to all processors at the same time in a 
multiprocessor system; it can be achieved by con- 
necting a common bus to all processors in the sys- 
tem. Data broadcast to each PE may not make sense 
for many computations. However, for some particu- 
lar computations, it provides a better performance 
and is an alternative approach for reducing the 
communication requests between the array and_ the 
host. Sections 4.2 and 4.3 will show that the 
processor arrays with data broadcast capability 
(called broadcast processor arrays) provide better 
performance than the systolic arrays in [4,5] when 
they perform the matrix-vector multiplication and 
recursive filtering problems. 


A new inner product processor with data 
broadcast capability is shown in Figure 2(c) and 
called a type-C processor in this paper. The input 
and output of the register R, of the type-C pro- 
cessor are connected directly; thus, a data ele- 
ment loaded into the input of register R, will be 
broadcast to the registers Rh, of all the proces- 
sors in that row. 


Although a driver is required to drive a 
broadcast path, it only takes a limited area [6]. 
Furthermore, the propagation delay of the broad- 
cast data transfer may be larger than that for a 
nonbroadcast array. However, the delay is no 
worse than that of a clock driving a whole chip or 
system, and the data is transferred in parallel 


with the computation (such as with a multiplica- 
tion and an addition in inner product processor 
arrays) which usually takes many cycles. Thus the 
delay in data broadcast does not significantly 
affect the performance’ of processor arrays. 


4.2 Matrix-Vector Multiplication 


A broadcast array for band matrix-vector mul- 
tiplication is constructed by connecting W type-C 
processors in a row where Wis the width of the 
band of the matrix. An array with its associated 
data stream is shown in Figure 3(b) for the compu- 
tation in Figure 3(a). 


O34 O44 

G55 O53 

Gio Ooo 
qd 


On An xX ys 
Da G22 Gay X2 Y2 
G3, Gx Ox Ox X 3 Y3 

Gaz Gas Aas CO 45 * {%« = 1¥4 


Fig. 3(a) A Matrix-Vector Multiplication with p=2,q=3 


OT 54 O 64 
O43 as; 
O 39 C42 
ag a 


V4 


Fig. 3(b) A Broadcast Processor with Data Streams 
for the Multiplication in Fig. 3(a) 


The algorithm of the computation is reviewed 
as follows. All the registers R_ in the array are 
initially set to zero. At time j (joi), is 
broadcast to all R, registers, the element 
a = is loaded into register R, of the k-th 
prdebskdz4 é 


from the right and the element y 
enters the array from the right; 


the elangne 


YViisq-1 is initially set to zero, and it accumu- 
14¢85' the product of a and x, in each processor 
as it flows to the tart. Thtis, the matrix A is 
loaded into the array column by column and all the 
computations associated with x. are performed in 
the time unit j. In the remaindér of this paper, 
the data flow of matrix A is called a Column- 
Diagonal Form ah Ss = matrix A, since the 
matrix elements are processed column by 
column and ie ind? its diagonal successors 


(Qs 4t, jet) 
Narrow Band Matrix-Vector Multiplication 


For the matrix-vector multiplication in 
equation (4) with W<<n, the algorithm for the 


broadcast array requires 
P=-=W 
B = W+2 
T = n+wW 


and, thus, 2 2 2 

PBT~ = W(W+2)(n+W)“ = (W°+2W)n for W<<n. 
From the value of CI derived in section 3, it can 
be seen that the ratio R approaches’ 1 and the 
algorithm is therefore optimal. 
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All the values of P, B, T, PBT“ and R 
required by the algorithms in this section and by 
the algorithm in [4] to perform a narrow band 
matrix-vector multiplication are summarized in 
Table 1 (a). 


Dense Matrix-Vector Multiplication 


For matrix-vector multiplications with W >on, 
a data format transform called partial row trans- 
lation (PRT) was proposed in [7] to modify the 
original systolic data flow for an efficiency 
improvement. The broadcast technique in combina-~ 
tion with the PRT technique provides a greater 
efficiency when Won. 


4.3 Recursive Filtering Problems 


A linearly connected broadcast array with m 
type~C processors and one buffer can be used to 
solve the m-th order recurrence problem in equa- 
tion (6). In order to illustrate the idea and the 
improvement gained by the data broadcast tech- 


nique, the example used in [4] is given here as 
follows: 
xX, = afx,_, + b*x,_> (9) 


+ c#x,_3 +4 
where a,b,c= and d aré ~Sonstants- 
Figure 4 shows the array structure and the 
data streams for this example. Before the compu- 
tation starts, the constants a, b, and c are 
loaded and stored in registers R. of each proces- 
sor, respectively, and the constant d is stored in 


Fig. 4 A Broadcast Array for the Computation of a 
3rd Order Recurrence Problem 


the right most processor, for the entire computa- 
tion. At the beginning of the time unit i, each 
xi (i121) with an initial value, d, emerges from 
the right most processor and accumulates its par- 
tial product terms as it passes through the system 
to the left. The left side of the linearly con- 
nected array is a buffer. It is used to latch the 
final value of x, at time i (i>m) and to broadcast 
it back to R, of all other processors in order to 
compute x x and x,,4- During the first nm 
time units, 1 the giveh*- values of x x 


~m+1? es 
preee,Xq are broadcast in sequence, one in each 
tifie unit. 


A result is piped out from the buffer every 
time unit instead of every other time unit as in 
the original systolic array. The algorithm in this 
section requires 

P = m+1 
B= 1 (the constant a, b, c, and d are 
prestored in the array.) 


T 


the constant a, b, c, and d in serial.) 
thus, 2 2 
PBT~ = (m+1)(n+2m)“ = (m+1)n when n>>m. 
When n>>m, which is the usual case in signal pro- 
cessing problems, the value of R approaches 
(m+1)/m; this implies that the algorithm is 
optimal in the limit. 


4.4 Matrix Multiplication 


Since the matrix multiplication systolic 
algorithm in [4,5] does not perform efficiently 
for band matrix multiplication with large W, an 
algorithm was proposed in [7] for an ortho- 
connected array, such as for the MPP system, to 
achieve a better performance in processing the 
band matrix multiplication when W>n. However, this 
array cannot perform narrow band matrix multipli- 
cation efficiently. 


When we stack n matrix-vector multiplication 
broadcast processor arrays in a broadcast two- 
dimensional array, it can be used to perform 
dense matrix multiplication. Figure 5 shows the 
broadcast two-dimensional array and _ the data 
streams for processing a dense matrix multiplica- 
tion. When the matrix A is broadcast from the left 
side of the array and B is fed into the array from 
the top edge, each processor in the array is used 
to accumulate the partial product terms for an 
element of the matrix C; the final result of the 
matrix C is shifted out after the computation is 
done. This array performs dense matrix multipli- 
cation very efficiently, but it has the same prob- 
lem of inefficient narrow band matrix multiplica- 
tion as the array in [7]. 


We therefore modify the original hexagonal- 
connected systolic array by reversing the data 


n+2m (including m time units required to setup 


b. b, 4, D, 
By b,, b,, 34 
oo ope Oy, 
b, b,, oe b, 
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Fig. 5 The Data Streams for a Broadcast 2-Dimensional 
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Array to Perform a Dense Matrix Multiplication 


flow direction of the result. Our parallel algo- 
rithms performed in this structure provide better 
performance for both narrow and dense matrix mul- 
tiplications. 


Narrow Band Matrix Multiplication 


In addition to the CDF, we introduce’ two 
more data flow formats called Row-Diagonal Form 
(RDF) and Backdiagonal-Diagonal form (BDF), which 
can be used in matrix multiplication algorithms 
with a greater performance. When a matrix is pro- 
cessed in RDF ina systolic array, the elements 
are loaded into PEs row by row and followed by 
their diagonal successors. The BDF owes, its name 
to the fact that because the backdiagonal of the 
processed matrix flows into the array in a line 
and each element is followed by its diagonal suc- 
cessor in the matrix. 


For the band matrix multiplication with W,=3 
and W.5=4 shown in Figure 6(a), Figure 6(b) shows 
the systolic array and data streams flowing in the 
directions indicated by the arrows. The matrix A 
in RDF is loaded into the array from the left top 
boundary, B in CDF from the right top, and C in 
BDF from both sides of the top. All the elements 
in the bands of matrices A, B, and C move synchro- 
nously through the array in three directions. Each 


= A backdiagonal of a matrix consists of 


those elements a, with the property 
{ aij | i+j Y donstant for i< i,j in}. 
? 


G19 0 Di Dis 
O21 Goo Gas Dar Day 
OC sg A33 C3 * D 30 


3 0 C11Ciw2 C13 C14 e 
Doz Ca1 C22 C23 C 24 C 95 
Ds, D35} = C31 C32 C33 C 34 


C 42 
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Fig. 6(a) Band Matrix Multiplication, W1=3 and W2=4 
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Fig. 6(b) Data Streams and Hardware Configuration 
for the Band Matrix Multiplication in Fig. 6(a) 
C; , is initialized to zero as it enters the array Dense Matrix Multiplicat 


and accumulates all its partial product terms 
before it leaves the array through the bottom 
boundary. Figure 7(a), 7(b), 7(¢), and 7(d) show 
four steps of the matrix multiplication in Figure 
6. | 


The complexity PBT of this algorithm can be 
derived from the data as follows: 


P = W *W, 
T = n+M; M= min(W, Wo) 
and then 
PBT 


224. #W(W.+W.)n> when n>>M 
R= 1 Ve ee 
The value of R is improved from 3 with the origi- 
nal algorithm to 1 with the algorithm in this sec- 


tion. The value of R of the algorithm and array 
in [7] is O(n); it is therefore inefficient for 
large n. 
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For processing band matrices with W>n, the 
processors in the two ends of the array process 
only a few operations and are idle for the other 
times; this causes the algorithms in [4,5] and the 
ones in this paper for narrow matrix multiplica- 
tions to be inefficient. 


We now propose thrge new data flow formats 
(AD : » and Cp) constructed from the data 
formats -- matrices A in RDF, Bin CDF and C in 
BDF. Applying these new formats to the hexagonal 
connected systolic array provides greater effi- 
ciency improvement for dense matrix multiplica- 
tion. In order to describe the data rearrangement 
scheme for a multiplication with two band matrices 
with W,,W,>n, Figure 6(b) shows that each of the 
matrices A, B, and C is divided into three parts 
##*# DM represents dense matrix. 


(a) 


(ec) 


Fig. 7 First 4 Steps of the Computation on Fig. 6 


~- main diagonal, upper part, and lower part; they 
are represented by the subscripts m, u, and & 
respectively. 


The data format A. of the matrix A is formed 
with A coneatenated by A, and A,’ then by another 
copy of A. as shown in Figure 8. The format Bp, 
is formed with By, concatenated by E and Ba? then 
by another copy of By. The format Cy, is formed 
with the matrix C in BDF concatenated by C. and 
C Both By and Cp, are also shown in Figure 8. 


ye 
The data in formats Ap» Bo? and Cpy are fed 
into the array through the boundary processors at 
the top left, top right and at both sides respec- 
tively as shown in Figure 8 to perform a dense 
matrix multiplication. Each element in the matrix 
C is initially set to zero the first time it is 
fed into the array. It passes through n processors 
in one or two columns to accumulate its n partial, 
product terms. From Figure 8, we know only n 
processors and 2n time units are required to per- 
form a n-by-n dense matrix multiplication. 
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The values of PBT@ and R of the algorithm can 
be derived from the values of P, B, and T as fol- 


lows: 
P= n® 
B = 6n 
T = 2n 
and the 
PBT = 2h#n? 
R = 8 


All the values of P, B, T, PBT“, and R for dif- 
ferent algorithms and arrays to perform dense 
matrix multiplication are summarized in Table 
2(b). It shows that the broadcast two-dimensional 
array performs most efficiently for dense matrix 
multiplication, but it is not suited for narrow 
band matrix multiplication. The two algorithms 
proposed in this section can be chosen to obtain 
the best performance for matrix multiplication 
with distinct values of W in the same array. 
Using the algorithms in [5], [7], and this paper 
to perform matrix multiplications for various 
values of W, the values of the ratio R are plotted 
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Fig. 8 The Data Streams A 
for a 4-by-4 Dense 


in Figure 9. It shows that the algorithms pro- 
posed in this paper provide the highest efficiency 
when performing matrix multiplication. They are 
also the most efficient for the cases between 
narrow-band and dense matrices. 


Conclusion 


The measures PBT and R have been proposed 
in this paper to evaluate the efficiency of an 
algorithm when it is performed’ on a processor 
array. A data broadcast concept was introduced for 
processor arrays to obtain better performance for 
particular computations. Several parallel algo-~ 
rithms and processor arrays have been presented 
to obtain an efficient performance for the recur- 
Sive filtering problem, matrix-vector multiplica- 
tion, and matrix multiplication. 
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Table, 2(a) Summary of the values of P, B, T, R, and 
PBT” for band matrix multiplication for W<<n, 


with C & ese #n, I ¢ 2(W,+Wo)n, 
ortho- algorithm in algorithm in broadcast 
systolic connected section 4.4 section 4.4 two-dimensional 
algorithm algorithm for band for dense processor 
in [4] with PRT [8] matrices matrices array algorithm 
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Table fh) Summary of the values of P, B, T, R, and 
PBT~ for dense matrix multiplication, 


with C = n-, I = 3n", and CI = 3n~. 
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PARALLEL SIMULATION BY MEANS OF A 
PRESCHEDULED MIMD- SYSTEM FEATURING 
SYNCHRONOUS PIPELINE PROCESSORS 


M. Tadjan, R.E. Buehrer, W. Haelg 


ETH (Swiss Federal Institute of Technology) 
Institut fiir Reaktortechnik 
CH-8092 Zurich, Switzerland 


Abstract --- The software package PSCSP (power- 
series continuous simulation program) for the 
simulation of continuous systems being developed 
at ETH achieves a great deal of parallelism by 
making use of the power-series integration tech- 
nique. A parallel version of PSCSP currently runs 
on the ETH-Multiprocessor EMPRESS. 


The new processor scheduling strategy presented in 
this paper was developed in order to further im- 
prove the processing time of this integration meth- 
od. A summary of different performance studies is 
presented, demonstrating that this method is very 
well suited for an MIMD- Parallel processor con- 
sisting of several synchronous pipeline processors 
connected to a powerful EMPRESS-type intercommuni- 
cation memory. A description of such an architec- 
ture is given. 


Introduction 


The software package PSCSP (power-series continu- 
ous simulation program) (#) for the simulation of 
continuous systems makes use of the power-series 
integration method. In addition to a fundamental 
high integration speed this technique offers a 
great amount of parallelism ideally suited for an 
implementation in an appropriate MIMD- (multiple- 
instruction stream - multiple-data stream) paral- 
lel processor. Earlier predicted speed improve- 
ments (2) are currently verified on the ETH-Multi- 
processor EMPRESS (5). 


To give a survey of some features of PSCSP which 
are important in this context we first describe 
briefly the method of integrating by means of 
power-series and the parallelization concept used 
in the parallel version of PSCSP. The improved | 
scheduling strategy is explained afterwards and - 
illustrated by a practical example - the theoreti- 
cal gain in speed achievable by this technique is 
presented. In the final section a parallel 
processor consisting of synchronous pipeline 
processors connected by an EMPRESS-type intercommu- 
nication memory (intercom) is described. 


(2) oscsp, designed and written at ETH (3), actu- 
ally exists in two versions: the sequential ver- 
sion is intended for implementation on a standard 
computer such as the PDP-11, the DEC-10, etc., 
while the parallel version is implemented on the 
ETH-Multiprocessor EMPRESS (3,4,5). 
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Parallel integration by means of power-series 


A system of n first order differential equations 
is given as follows: 


are given as initial values. 


The solution for y;(x) at x=x +h using the method 
of power-series expansion 1s 
0 y  (x-x,)” 
y, (x) = . y, (x5) sr (2) 
=0 ve 
(£3(X50V1.9+++s¥y,) must be holomorphic in the inter- 
val (x-x,)). In order to proceed numerically, (2) 
is separated as follows: 
v v 
O v h 
: = Z ° + OR. 
y, (x) bv) ort R, (vox) (3) 


When all Ri (V9 5x) satisfy a given convergence cri- 
terion (3) the expansion is terminated and calcula- 
tion of the next yz(x) is started with x:=xth, x>: 
=xXoth). Otherwise, v is further increased and an 
additional term is added to (3). 


In (2), (3) it 1s shown that evaluation of the 
function f; in (1) can be separated into individual 
tasks having the nature of simple arithmetic opera- 
tions (e.g. r=ptq, r=pxq etc.) or elementary func- 
tions (e.g. r=g(p)). These tasks can be standard- 
ized and stored in a program library. (This elimi- 
nates the need for calculating specific and compli- 
cated recursion formulas for each individual is; 
instead one can define simple recursion formulas 
for the tasks mentioned. As a result calculation 

of the coefficients in (2) is rather trivial.) In 
general, depending on the structure of the problem 
to be simulated, several of these tasks are inde- 
pendent of each other and can therefore be calcu- 
lated simultaneously (first stage of parallelism). 
As shown in the following example, the recursion 
formulas of the recursive tasks additionally show 
an inherent parallelism (second stage of par- 
allelism). 


Example 


r=peq (4) 


Vv v 
. 2 0) 3 .iv). Bo 
By defining ae ae ae cele p v! 
eo aes 
rn ny 


the corresponding recursion formula reads, 


(5) 


r 
Vv 


v 

2 Poe Qi 3 (v>o) 
In (3) it 1s shown that all other recursive tasks 
lead to similar recursion formulas. The calcula- 
tion of such sums of products can be done quite 
easily by means of "recursive doubling". The 
critical path length of such schemes is directly 
dependent on v. 


A sample calculation of component fj, of f;(x) is 
illustrated in figure 1. As a consequence the time 
for calculating fj, is also dependent on v. 


Performance improvement by means of 
a new processor scheduling strategy 


Reduction of the critical path length 


The calculation time for fj, (see figure 1) can be 
reduced substantially if one can find a scheme 
where a complete pass is independent of wv (i.e. 
the critical path length is constant). The exist- 
ence of such a strategy 1s demonstrated below, 
using recursion formula (5), 


Vv 
By ah Pee es 
S=0 


The formula can be separated as follows 
U2) 


= + + 
Ty Po FAT PY FI Les * 4s (6) 
pS ene SR enn 
a 
The equivalent separation at (vt1) can be written 
v 
= + + 
P+ Po . Se Poet * 5 LPs - I 4+1-s (7) 
a eee 
R 
where v+1 
ye 
R + * + 
vt) P, a 4, Py Ly L Ps . q+1-s | (8) 
y 
v 


In the standard PSCSP r, is calculated in one pass 
and ry41 1n the next. The fact that Ry41 of ry41 
in (7) contains at most the v'th derivatives of 
variables p and q makes it possible to compute 

R already at the time ris calculated. 


vt 
since 


= + + 4 
Re oe * q+ Pied * qd. Cy, where C R 


(9) 
v+i 


1s completely independent of v, recursive tasks 
are reduced to nonlinear ones, provided R\41 is 
available after r, has been calculated. The modi- 
fied graph for the example of figure 1 is pres- 
ented in figure 2. The critical path length is 
constant. 
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Processor scheduling strategy 


Given 

- a task system B=(T,<) in which T= (A, s+eeoA) 
equals a set of tasks and < is the partial or- 
dering relation. 
A; <A; implies that A: cannot start execution pri- 
or to completion of Aj. 

- a weighting function a(A;), representing the ex- 
ecution time 1; = a(A;). 

- a fixed number of identical processors, n. 
The objective is to find a partition Ty...T, of 
T such that the largest execution time on any 
processor 


t 


max 7 OX (x tj) 


Vl otseTs (10) 
1s minimized. Considerable attention must be paid 
to the development of fast heuristic scheduling 
algorithms, yielding suboptimal results (7,8). 


Whenever a Simulation problem has a deterministic 
structure (i.e. if one has a time-invariant problem) 
its task system is identical for each integration 
step. Consequently, compilation and scheduling 

have to be done only once. The appropriate (sched- 
uling) information for each individual processor 
can be computed in advance and loaded into the 
corresponding processor memories. 


The standard parallel version of PSCSP is able to 
create graphs for deterministic problems as out- 
lined in figure 1. By means of the previously dis- 
cussed method of reducing the critical path length 
graphs like the one shown in figure 2 are obtain- 
able. 


In order to compare both versions in terms of tc- 
tal execution time t (10) a model of a parallel 
processor was defined and simulated on a PDP-11. 
This model consists of a number of synchronously 
working pipeline processors and an intercommunica- 
tion memory intercom as will be outlined in the 
hardware description later on. The arithmetic of 
each processor 1s a dynamic multifunction pipeline 
(9), whereby the number of stages can be varied in 
the model. The execution times of any operation 
(e.g. addition, multiplication etc.) within a task 
are variable too, but are assumed to be identical 
in this example. Both graphs (figure 1,2) were 
scheduled according to the "level algorithm" (8). 
As a reference we also determined the ideal calcu- 
lation time tigeal>» given as 


t y T: 


ideal ~ 
Vi t5eT; 


(11) 


(any task T. is part of the critical path). 

As expected, execution times referring to the im- 
proved graph turned out to be significantly 
shorter. Results for one of these examples, the 
"restricted three body problem" (3), are presented 
in figure 3. 


Hardware description of an 
appropriate multiprocessor 


The multiprocessor described below has been des- 
igned in accordance with the requirements of the 
integration technique just described. Components 
and related functions are presented in table 1. 
The intercom, being a slightly modified version of 
the one installed in the ETH-multiprocessor 
EMPRESS (5), consists of a quadratic organized 
memory matrix whereby an individual processor 
duplicates its data into all elements of its asso- 
ciated row (see figure 4). Reading is possible in 
all elements of its associated column. In addition, 
every execute processor has the facilities to 
write into the supervisor row (in this mode, at a 
specific time slot only one execute processor or 
the supervisor itself gets access to the cor~ 
responding write lines w1,). As mentioned earlier, 
scheduling of the execute processors is done at 
compilation time in the supervisor processor. As a 
result, prior to the start of the integration part 
the program memories of all execute processors are 
loaded by the supervisor. 


The synchronization of the execute processors is 
controled by a dedicated logic in the supervisor 
processor. Note that intermediate results of the 
"recursive doubling" and the results a; (figure 2) 
are available in the execute processor region of 
the intercom while the results of fj, are trans- 
ferred at a prescheduled time slot by the appro- 
priate execute processor to the supervisor row to 
be available for further processing. 


Conclusions 


The need for efficient algorithms and powerful 
computer hardware is very acute in the field of 
digital simulation. The outlined method of in- 
tegrating differential equations by means of 
power-series in a prescheduled MIMD- pipeline- 
multiprocessor points out possible solutions for 
some of the problems in this field. The relative- 
ly large effort of compilation (including the 
scheduling of processors) is worth-while because 
in many simulation problems one does not often 
have to change the model but only related para- 
meters. Compilation needs to be done only once for 
different runs, allowing an unrestricted profit 
from the fast execution of the integration part. 


Table 1: 


Component 


Supervisor computer 


Execute processors 


Intercom 
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Functional survey of the multiprocessor components 


- I/O activities 

- program compilation, preparation 
and execute processor scheduling 

- control of integration 


- execution of arithmetic operations 
~ data provision for further calculations 


—- simultaneous transfer of intermediate 
or final results 


a 
9 
t 
an 
On / 
re fy My fy 
; O non-recursive task, single instr.; O non-recursive task, single instr.; 
i epl # f(v) 9; epl # f (vy) 
dG; QZ recursive task ; cpl = f(v) °° & ee ae a a a ee 
as iti . 
cP eritical path length Roast & recursive task ; cpl = f (v) 
' 


Figure 1: Calculation of fi : a eed 
she v ci a epl: critical path length 


Figure 2: Optimized calculation of fi. 5 example 
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Figure 4: Architecture of the Multiprocessor 
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Pipelining Array Computations for MIMD Parallelism: 
A Function Specification. 


by Dennis Gannon 
Department of Computer Sciences, Purdue University 
West Lafayette, Indiana 


Introduction 


This paper describes a formal link between the 
data flow model of MIMD computation and the design 
and analysis of systolic systems. To establish the rela- 
tionship between these two models of computation we 
describe a small set of functional operators which will 
enable us to express many vector and array algo- 
rithms as networks of interacting data-driven 
processes. Using these tools, we will then show that 
the data flow graphs of many functions can ge refor- 
mulated as "systolic" systems. 


The main result of the paper is a theorem which 
gives conditions which will garentee that the systolic 
version of the computation graph will perform asymp- 
totically as fast as a fully concurrent execution of the 
original data flow graph. | 


Vector Valued Data Flow Operators. 


The majority of highly parallel computation is 
based on array and vector data structures formed 
from primitive scalar types such as the integers 7, the 
reals fF, booleans B, and complex number C. Let 
{T, +=1,,n} be a set of primitive data types. Define 
the direct product type, written as 


Thr 
Tt or as 7\2Tox'' 2xTyp, 
ix 


to be the set of n-tuples 
(@1,%p,2g,....2y) with 2,€T;, for i=1..n. 


More generally we define a domain recursively as 
either 


1 Aprimitive scalar type suchas Z, Ff, B,orC., 


e The direct direct product of a finite set of 
domains. j 


For example, the set of real n-vectors is ner 
which is denoted by A” and the set of n by m integer 
arrays is ma An array of ‘records’ such that each 
record contains an integer, a boolean, and @ reals 
could be described as Ti (2eBaR?), The individual 
components ‘of a member of some domain will 


addressed by indexing that describes the position of 
the component in the structure. 


All of the programs constructed below will be 
described as "functions" 


f:D\--7De 


from one domain J, to another Dg. More precisely, f 
will be a structured set of interacting processes that 
collectively define a finite state machine (a function 
with memory in the sense described by Ackerman 
[Acke82].) The basic components of a function are 
simple sequential processes that will be called cell 
functions. Each cell function performs a "small” set of 
scalar operation on a "small"’ set of variables. For 
example, the expression 


Research supported by NSF Grant MCS-8109512. 
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Cfunction f(x, y, z: A): R: 
f:= y - (x+z); 


defines a function f :R3 --+ & which could be graphi- 
cally represented as shown in figure 2.1. 


Figure 2.1 Cell function node 


The inputs to a cell function represent queues of 
values, We shall define the execution semantics to be 
data-driven, i.e. if at time ¢p all input queues to a cell 
become nonempty, one value is removed from the 
head of each queue and at time ¢)+1 the cell produces 
output values. 


Cell functions represent small units of sequential 
computation. Explicit parallelism is expressed 
through the application of higher order functional 
operators. Of the many possible classes of operators 
four are described below. 


A. The Product Operator. The simplest form of 
parallelism is the vectored application of a function. 
Given 


fi Di--- 7D, 421.97 
The product operator defines a function 

7 nr qT 

foe att a ae 


which represents the concurrent operation of the n 
functions f;. 


B. Permutation and Data Movement Operators. 
Many important computations cannot be specified 
completely without defining certain complex data 
movement operations. 


For example, the Rotation operation executes a 
right circular shift of a product structure. 


Fotate, (Zi, 88 Bn) = (Spee ne 420 En Liye Sn-K) 


Many other usefull permutations can be defined but 
they will not be needed here. 


C. Iterated Composition: The Chain Operator. 
Given a function 


Ff :DgrzDy -—-7 Dy xD, 


for some structured domains D,, D, and D,, the chain 
operator defines a mechanism to iterate f over the 
values in Dg. More Specifically, if f is a function 
defined with the header 


function f(x: Dg; y: Dy ): Da xDz; 
then the iteration 
var x: DexzD,; 
x := initial values; 


fori:= 1tondo 


can have at least two interpretations when f is viewed 
as network of interacting cell functions. The simplest 
of meanings is given by the chain operator which con- 
structs from a function f a sequence of copies of f 
where the output if the i” copy is directed to the 
input of the (+1) copy. We denote this by 


x:= f(x, y™); 


Tr nr nr 
Chf : Dex Dy --> Dax Dy 
t=1 t=1 t=1 


which is represented by the graph in figure 2.2. 


Any composition of the operators and function 
constructors described above can be viewed as 
defining the data flow graph of a program: graph edges 
correspond to the binding of function parameters and 
each edge represents a queue of values; graph nodes 
correspond to the basic cell functions. Because we 
have not specified any operators that may conditionaly 
select from a subset of input values, any function net- 
work has the property that the graph is acyclic and 
the order of arrival of operands to a cell node is deter- 
mined by the structure and not the timing of the sys- 
tem. The latter property shall be referred to as 


—_ ae oe et ew ee oy —_ 


re ales Se eee St 


Ce eieetseu 


nr : 
Figure 2.2. Chf (z y) 


dependence synchronization and is sufficient to 
guarantee that pipelining the system is entirely well 
defined. In particular, it implies that the "graph" may 
be executed on a data flow machine where only the cell 
function address and argument position is needed to 
form the "packet address”. 
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D. The Systolic Iteration. For a single set if input 
(xz, y',..., y™) the low utilization of the cell functions in 
nr 


Chf represents a large memory (large silicon area) 
i= 


cost for any hardware implementation of this con- 
struct. The natural alternative is to allow one copy of 
f to be used in "feedback” loop, i.e. some of the out- 
puts of f are connected to some of its inputs as illus- 
trated in figure 2.3. While the resulting structure is 
not easy to initialize (see [Gann82] for details), it does 
permit the cells of f to be reused on each iteration. 


3 
As a data flow operation one can show that Sys is 
t= 


dependence synchronized if and only if f is depen- 
dence synchronized. On the other hand it is not clear 


nr 
Figure 2.3 Syt graph structure. 
t= 


that the Sy operation can express all the parallelism 
that is provided by the chain construction. In fact, it is 


nr 
not hard to construct examples where Chf executes 
i= 


nr 
O(n) times faster than Sys . There is, however, a 
ix 


situation where one can prove the the systolic itera- 
tion exhibits the same parallelism as the chain opera- 
tion. 


Recall that a function is said to be transitive if 
there exists a formal dependence of each component 
of the output on each component of the input. We shall 
say that a function f is weakly transitive if its k-fold 
self-composition (f*) is transitive for some k. Weakly 
transitive functions occur in computations such as the 
L-U decomposition of a matrix, convolution based 
operations such as the FFT, the solution of partial 
differential equations, the solution of linear 
recurrences, and many graph algorithms such as tran- 
sitive closure. In this case we have the result 


THEOREM. Let f be a weakly transitive function that 
executes in constant time. Then 


1) ¥L) 
1 The time complexity of Chf is o(n) and time(Syf) 
i= t= 
nr 
= time( Chf). 
i= 


nr 
© As a data flow graph the edges of syf represent 
i= 


queues of values. These queues are of bounded 
length where the bound depends only on f and 
not on 7n. 


The proof is given in the report [Gann82]. 


The above paragraphs have stressed Data Flow 
Semantics to describe systolic systems. The problem 
of transforming the data-driven semantics to a syn- 
chroneous set of processors has been considered by 
Cuny and Snyder [CuSn6e], 


To illustrate the above ideas and constructs, con- 
sider the q™ order linear recurrence relation 


, a Safe, L:=q+1,....7 
j=l | 


zi = Cy 2:=1,...4.g 


where 2,0; and a} i=in, j=1.m are all real 
numbers. Programmed in the standard manner shown 
below the sequential complexity is roughly 2nq. 


for i: = qt+1 tondo 


zd) := Yafef-, 


j=1 
(i) .— » (t-1) 


xf) := 2 fim), 
write(z{*), 


The superscripts indicate iteration count and are 
suppressed in the sequentail computation. A cell func- 
tion to compute one step of the inner product formed 
by the summation is given by 


Cfunction ipstep(a, x, s:f): RzR; 
ipstepe := s + a*x; 


ipstep, := x; 


Applying the chain operator over the first parameter 
we generate the complete inner product 


function IP(x,a: R2): Re; 


q 
IP := ch ipstep(0,a, ,2;); 
I; 


which is pictured as the network in figure 3.1. 


Figure 3.1 The inner product function. 


The function has been constructed to compute the 
inner product as the first component of the result and 
return the values of x as the remaining q components. 
There is no parallelism in this function other than 
structural: each cell function can act only after its 
right neighbor acts. The complete recurrence is given 
by a second application of the chain operator. 


n 7 ° 2 
ChIP (cq Cg-1,.. .,C 1@g,...,@5 ) 


which is pictured in figure 3.2 below. 
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Tr 
Figure 3.2 CrIP(c,x). 


Notice that this application of the chain used the asso- 
ciativity of the direct product operator to identify 
RzRi with R'x2R. The reassociation of the components 
of the output turn IP into a weakly transitive function. 
Hence, one may apply the Sy operation as a replace- 
ment for the last chain operation. The result is 


nr ° ° 
= SYP (Cg Cg —1-10sC 110gs- +s} ) 
i= 


which is illustrated in figure 3.3. 


Tr 
Figure 3.3 SyIP(c,a), 
i= 


This structure is identical to the systolic recurrence 
solver of Kung and Leiserson [KuLe80]. (There are 
several other derivations of systolic arrays from 


formal principles. See for example Kuhn [Kuhn80].) 
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COMBINING PARTIAL RESULTS IN AN MIMD COMPUTER 


Harry F. Jordan 
Department of Electrical Engineering 
University of Colorado 


Boulder, 
Abstract 


One of the most demanding types of 
computation in an MIMD computer is one in which 
all instruction streams are tightly coupled in 
producing a Single result. This paper treats 
this problem with respect to a shared memory 
multiprocessor. Experimental verification of 
the analysis is obtained on the HEP computer, a 
pipelined multiprocessor. The specific problem 
analysed is directly comparable to a previous 
analysis of the same problem on a network 
computer. A comparison suggests that there is 
a strong correspondence between delays due to 
conflicting access to a shared memory cell in 
the current case and the conflict for use of 
communication links in the network computer caSe. 


Introduction 


In any MIMD computer an important type of 
computation is one in which a large number of 
processes contribute to a single result. The 
demands made by such a cooperative computation 
on the data communications and synchronization 
facilities of a parallel architecture are quite 
stringent if good performance is to be achieved. 
The author has previously been involved in a 
study of cooperative computation on a computer 
consisting of a large number of microprocessors 
sharing no memory but having several types of 
high performance communication structures [1]. 
The current paper deals with a true multi-~ 
processor in which all data memory is shared. 
The experimental results presented are from the 
HEP computer [2,3,6], a pipelined, shared 
resource, MIMD machine. 


Summation of Partial Results 


In a multiple instruction stream computer 
numerical algorithms are carried out by multiple 
parallel instruction streams, or processes, 
which share data. A typical short term behavior 
is that N processes run completely independently 
through a computation P(i); i = 1,2,...,N after 
which some partial results V(i) must be combined 
across all processes. A typical form of combina- 
tion is summation, treated in the discussion 
which follows. The discussion applies, however, 
to reduction (in the APL sense [4]) over any 
commutative and associative dyadic operator. 


If we characterize the processes performing 
computations P(i) as producers of the partial 
results V(i) and identify a consumer process 
which uses the sum result R then two methods of 
performing the summation can be identified. The 
first method we call consumer driven because the 
consumer process executes instructions which 
actually perform the individual additions. We 
will use a dollar sign $ preceding a variable 
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name to denote the combination of a value with a 
full/empty status as is done in the HEP multiple 
instruction stream computer [3]. Further we 
assume ahardware mechanism, asin HEP, which will 
delay the reading of such a variable until it is 
full and delay writing it until it is empty. 
Reading $V will set its status to empty while 
writing it will set the status to full. 


A producer driven summation which runs 
efficiently regardless of the order in which 
partial results are produced makes use of a 
Single communications location to pass partial 
results to the consumer process. The hardware 
synchronization mechanisms prevent a producer 
from storing its result until a previously 
stored value has been consumed. The programs 
for producers and consumers then appear as: 


Producer 


$V := partial result j ; 


Consumer 
S := $V ; 
for K := 2 step 1 until N do 


S := s + SV ; 


Realizing that producers are essentially 
idle while waiting for the consumer to empty $V 
leads to the second method of summation which 
we call producer driven because the producers 
perform the actual additions. In this method the 
partial sum ( s in the above programs ) is shared 
by the N producers. This shared partial sum $S 
is initialized to zero and a count location 
SCount is initialized to N. The producer program 


is then: 
Producer j 
C := $Count - 1; 


s := $S + V(j); 


if Cc # O then begin $S := s; $Count :=C end 


else SR := s; 


Most of the code executed by Producer j is 
concerned with counting the number of producers 
which have contributed to the sum and determining 
the last one, and hence completion of the result. 
This counting was done by the loop in the 
consumer driven method. If completion could be 
determined by some other means, then a producer 
could execute only: S$S := $S + vas With the 


completion count the consumer need only use the 
result SR when it becomes full. It should be 

noted that the section of code from the use of 
SCount in line one to its filling in line three 


forms a critical section which at most one pro- 
cess j can execute at a time. Since $S appears 
only in this critical section it need not have a 
full/empty status. 


Data Conflict/Synchronization Time Analysis 


The time required to complete such a summa- 
tion is determined by two influences: the times 
required by the subcomputations Ea which we 


will call eee and the times spent by producers 


Waiting to execute the critical section on $Count. 
The time required for completion of the result 
t($R) is certainty no less than 

{t(P3) j = 1,2,-..,N} 
and if t(P5) has a variance which 
is much larger than the time spent by one 
producer in the critical section, tye then we 


expect critical section competition to have a 
small effect. The critical section will have its 
maximum influence when all t(P,) are equal. 


In this case the time required to produce the 
result will be t(SR) i + Nt: 


In general, with fairly tightly synchronized 
processes, we expect to be able to say that 
t(P.) is randomly distributed with a mean which 


is much larger than te and a variance which is 


larger than 7 but not larger than N-t- In 


this case the order independence of the summation 
algorithm is useful but critical section con- 
flict also influences the computation time. No 
matter what the variances of t(P.) this case will 
occur for N sufficiently large, assuming that 

the variance of ar does not increase with N. 


Since the time delay due to critical section 
conflict results from the sharing of $Count by 
up to N-1l other processes, it can be reduced by 
employing a form of batch adding in which the N 
processes are divided into groups of G processes 
so that each group forms an independent sum. 
These partial sums are then combined G at a time 
in the manner of a base G tree until a Single 
sum results. 


A recursive procedure Sum for group summa- 
tion is organized as follows. Assume that 


K : P 
N G terms are to be summed in groups of size 
G and that for each level 2.of the summation 


tree, there are er sum variables and corres- 
ponding full/empty count variables. A recursive 
procedure to sum a term u into sum number sn at 
level 2 of the tree involves each producer 
calling the procedure at level one. The last 
producer to contribute to a group sum carries 
that sum to the next level of the tree, adding 
it to the group sum at that level. One of the 
producers, the "last" one, will call Sum once 
for each level £ = 1,2,...,K and terminate after 
filling the result $R. To adapt this procedure 
to arbitrary N it is only necessary to set 

K = ceiling (logg N) 


and to alter the initial values of the count 
variables. 
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A detailed analysis [7] of the time to 
complete a summation yields an upper bound of 
the form: t($R) < time for largest partial result 


+ time for initial entry to Sum 
+ K x time for single level 


+K x (G - 1) X critical section time. 


Assuming N to be fixed, it is of interest to 
determine an optimal group size G. Only the 
last two terms of the bound depend on G and 
minimizing the sum of these terms over G leads 
to a value of G satisfying G(ln G - 1) r-l 
where r is the ratio of the time for a single 
level to the critical section time. Since the 
critical section is contained within the code 
The optimal 
ratio is 


for a level it is clear that r > l. 
value of G for several values of the 
shown in Table l. 


2.718 S.590. -A<3i9 45971 


Table l: 


Optimum Group Size G 


Experimental Results 


The group summation method discussed above 
was programmed for the HEP multiple instruction 
stream computer using the HEP Parallel FORTRAN 
language [5]; a set of 50 producer processes were 
created which, for simplicity, summed their 
process indices j} ; j = 1,2,...,50. Thus all 
Ptr? O and the worst case conflict situation 


obtained. A main program started the 50 
producers synchronously and waited for the sum 
to become available, timing the length of the 
wait. The results for N = 50 and several 
values of G are shown in Table 2. 


Group Size Time to Sum 


G microseconds 
2 759.7 
3 707.0 
4 739.4 
5 793.9 
6 845.2 
7 937.0 
8 941.7 
9 980.7 
10 1066.4 
25 2004.1 
50 3616.4 


Table 2: HEP Parallel FORTRAN Group Summation 
The running times show a behavior which is 

consistent with a ratio r between one and two. 

An assembly code listing of the sum procedure 

yielded a ratio of 1.233 by actual instruction 

count. One way to look at the results is that 

in the best case, G = 3, an addition is being 

done every 14.1 microseconds. This point of view 

is misleading for three reasons. First, the 

procedure is not meant to compete with a 


summation done by a single process but is used to 
combine results produced by independent processes 
set up in parallel for other purposes. Second, 
the HEP computer cannot run 50 processes at full 
speed. In its current prototype version it will 
execute one instruction from each process every 

5 microseconds when all processes are active. 
Finally, the current (1980) HEP prototype has a 
severe indexing restriction which causes indexed 
expressions in FORTRAN to produce unreasonable 
numbers of machine instructions. To measure the 
latter effect, the indices for a group size of 

G = 3 were precomputed prior to execution time 
and a summation time of 233.3 microseconds was 
obtained. This respresents a speedup by a factor 
of 3 over the execution time indexing version 

and compares favorably with the time to sum 50 
integers with a single process of 243.2 micro- 
seconds. 


It should be noted that memory bank conflict 
is not an issue in HEP Since memory accesses 
are pipelined [3] and a Separate analysis [6] 
shows that 50 processes are more than sufficient 
to keep pipeline fall-through time from having 
any influence on computation time. Thus the 
above analysis, which relates only to the shared 
memory cells and not to larger blocks of memory, 
is the correct one in this case. 


Communication Delay Versus Access Conflict 


It is interesting to compare the qualitative 
aspects of the current analysis with those of the 
previous analysis of the Finite Element Machine 
(FEM) multi-microprocessor network [1]. In the 
FEM a time multiplexed bus connects all process- 
ors and a set of parallel communications paths 
connect processors with their eight nearest 
neighbors in a planar square array. The group 
summation considered in the present paper 
corresponds most closely to the "distributed 
computation" considered there. To make use of 
the parallel neighbor communication paths a 
group Size of nine, corresponding to a processor 
and its eight nearest neighbors, was chosen for 
the lowest level. The group size at subsequent 
levels was two, corresponding to a binary tree. 


The "distributed computation" had a fairly 
complex control structure but was still faster 
than the other algorithms studied in Spite of 
this control overhead. The speed resulted from 
the use of non-shared parallel communications 
paths for most of the information transfer. This 
corresponds very closely to the use of indepen- 
dent group sum locations in the current analysis 
to limit the number of processes competing for 
the same resource (memory cell). The group 
Size in the FEM case could not be varied 
reasonably since it depended on network 
structure. One of the other algorithms examined, 
however, the centralized algorithm, corresponds 
very closely to the use of a single group in the 
current study. The poor results obtained for 
that algorithm correspond well to those obtained 
for the single group of size 50 reported above. 
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There are enough qualitative similarities 
in the two analyses to indicate that communica- 
tions link conflict in a network computer plays 
a role in performance analysis which is quite 
analogous to shared memory conflict in a multi- 
processor. In fact, the HEP system with its 
pipelined memory access and full empty memory 
célls can be analyzed quite accurately by taking 
any cell shared between two processes aS a one 
word buffered communications link between those — 
processes. 
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Abstract 


Multitasked asynchronous processes on mul- 
tiprocessors are subject to performance degrada- 
tions due to the sharing of critical sections. 
The concurrent accessing of such critical sec- 
tions also results in the familiar lock-out prob- 
Lem. A general methodology to estimate the per- 
formance degradations of such algorithms on the 
processor utilization is presented in this paper. 
We study in detail a simple multiprocessor system 
with P processes’ sharing one critical section. 
We then generalize our study to a system with an 
arbitrary number of critical sections. The ap- 
proximation is good for the case in which the 
critical sections have low coefficients of varia- 


tion. Such an analysis, when applied to the pro-. 
cessor lockout problem, can result in an optimi- 
zation of the distribution of the critical sec- 
tions in a multiprocessor operating system. 
41. Introduction 
In order to guarantee the correctness of exe- 


cution of multitasked multiprocessor algorithms, 
explicit synchronization is often required in 
parallel algorithms. The resulting blocked time 
is large if the synchronizing processes have sig- 
nificantly different processing times. The per- 
formance of synchronized iterative parallel algo- 


rithms in multiprocessors has been studied in 
C[DUB82]. In some cases, the synchronization 
points may be removed. A synchronized algorithm 


in which all explicit synchronization conditions 
have been suppressed becomes an asynchronous al- 
gorithm. The concept of asynchronous algorithms 
is derived from the chaotic relaxation scheme in- 
vestigated by Chazan and Miranker CCHA69]. Bau- 
det has determined general convergence conditions 
for an asynchronous iterative algorithm (CBAU78IJ. 
Kung defined the properties of an asynchronous 
algorithm and described several examples CKUN76]. 
An asynchronous algorithm is controlled through a 
set of global variables accessible to all 
processes. Each process computes independently 
(processing phase), reads the global variables, 
modifies some of them, then activates a new pro- 
cessing phase or terminates. Global variables 
are usually accessed in critical sections in ord- 
er to ensure correctness. 

There are also many situations in which a pro- 
cessor that tries to access a critical section 
(C.S.), such as a ready list, is blocked because 
the C.S. is being used by another processor. In 
this case the processor may spin until the _ Lock 
is released. Therefore, the processor which at- 
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tempts unsuccessfully to access 
locked out. The Lockout problem 


the C.S. is 
is a direct 


result of multiple processors attempting to pro- 


cess common data structures asynchronously. This 
Situation resembles the memory conflict problem 
in tightly-coupled multiprocessor systems dis- 
cussed by so many authors [CCHA77]). For the 
memory conflict problem, the resources were 
hardware resources (memory modules), whereas, the 


Lockout is due to contention for software 
resources. There are numerous such shared data 
bases in a multiprocessor operating system be- 


sides the ready list. 
cation table, page 
lists. 

In order to evaluate the efficiency of various 
configurations of Lockable software resources, we 
must consider the effect of processor’ lockout. 
The most significant potential cost arises be- 
cause a process that blocks on a spin lock does 
not relinquish the processor on which it is exe~ 
cuting. Thus if a process blocks on a lock for a 
Lengthy period, an important system resource, a 
processor, will be Lost to the system for the 
duration of this period. 

Lengthy blocking arises when contention for a 
lock becomes too high. To keep contention at an 
acceptable Level, Locks must be used to provide 
mutual exclusion only when the grain size is suf- 


These include memory allo- 
allocation table and I/0 


ficiently small CJON79J. Grain size is deter- 
mined by two factors: the first factor is the 
amount of time for which mutual exclusion is 


necessary. A short critical section has a small- 
er grain size than a long one. The second factor 
is how frequently mutual exclusion is needed. A 
Lock that must be Locked often has a longer grain 
size than a_ lock that is touched infrequently. 


Locks are basically associated with pieces of 
code or data structures. As the number of pro- 
cessors and processes in the system increases, 


the grain size of such locks tends to grow be~ 
cause they are inevitably accessed more and more 
frequently. 

In this paper, we introduce an analytical 


model based on the central server model to evalu- 
ate the performance of asynchronous processes and 
the effect of software Lockout upon system per- 
formance. One classical approach to solving the 
central server model is to apply the BCMP model 
CBAS751 for closed queueing networks. In 
CKUM79], an aggregation approximation has been 
applied to this model for exponentially distri- 
buted critical sections. However, critical sec- 
tions tend to behave more Like deterministic 
servers, in which case the BCMP model is not very 
effective. In the following, we present a simple 
approximation to solve the central server model 
and to estimate the processor utilization due to 
the execution of asynchronous processes. The 


precision of the approximation is good only for 
critical sections with low coefficients of varia- 
tion. This approximation is similar to the one 
used by Hoogendoorn to study the performance of 


multiprocessor memories CHO0771J]. The model is 
introduced in its general form. 
An implementation of an algorithm on a_ given 


architecture is characterized by a set of perfor- 
mance features, Cf, fopeeer fy, extracted from 


the analytical model. Let F be the feature space 
for the given architecture and algorithm. F can 
be seen as the product space of the one- 
dimensional spaces generated by each feature: 


F = Cf x {fa Cecek {fy}. 


The topology of the space F is_ complex. The 
feature values may be dense along some coordinate 
axes, and discrete along some’ others. A 
performance index for a given architecture is a 
real function defined on F by the analytical 
model. Local maxima of the index Locate operat- 
ing points in F where the architecture and _ the 
algorithm implementation are particularly "well- 
matched" with respect to the index. The power of 
analytical models resides in the estimation of 
the impact on the performance of a given feature 
or subset of features in isolation. The average 
processor utilization, U, defined as the fraction 
of time a processor is busy, is used in this pa- 
per as the performance index. The feature-space 
approach permits the visualization of the effect 
of the performance features on the parameters of 
the algorithm and architecture. 


A simple multiprocessor architecture is shown 
in Fig. 1. A set of P independent processors ex- 
ecute tasks in a common shared memory through an 
interconnection network. This architecture is 
called "tightly-coupled" and is typified by the 
C.mmp CWUL81]. Such a multiprocessor system can 
implement multitasking in which a given algorithm 
is decomposed into a set of tasks that run in- 
dependently in parallel CFLY72]. When these task 
modules communicate intensively, they are each 
associated with a processor, under a group 
scheduling strategy [CJON79], i.e., the processes 
are swapped in and out simultaneously, and not 
individually. A process is not preempted when it 
is blocked at the beginning of a critical sec- 
tion; rather it "spins" (busy wait) CJON79] or 
waits for an interrupt without relinquishing the 
processor (Cin the second case, user hardware in- 
terrupts must be provided or else an operating 
system call is made). These strategies are 
designed to minimize the overhead and speed up 
the algorithm in an environment where the cost 
underutilizing processors is secondary. 

One problem which occurs in multiprocessor 
systems 1S memory contention CCHA77]. Generally, 
the memory is made of a set of independent 
modules. It is interleaved. The instruction cy- 
cle of each processor comprises a variable number 
of machine cycles’ such that at most one memory 
reference occurs during a machine cycle. A re- 
jected request is resubmitted at the next machine 
cycle. Under these conditions, a request to the 
shared memory can be characterized, in _ most 
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cases, by a probability of acceptance Poe result- 

ing in a 

CPAT81). 
In the architecture of Fig. 1, the processors 


geometrically distributed access time 


compete for the shared memory on a word-by-word 
basis. This results in performance degradations 
due to conflicts in accessing instructions and 


data CCHA77J. Let P be the number of processors 
and M the number of memory modules. Each proces- 
sor references the shared memory with a probabil- 
ity r during any machine cycle. A widely used 
approximation, which is justified by the inter- 
Leaved storage pattern, is that the references to 
the memory are independent and uniformly distri- 
buted among the M memory modules. This approxi- 
mation Leads to the probability of acceptance of 
a memory request as 


= My 4 = FP 


For a derivation of Equation (1), see, for ex- 
ample, CPAT81]. 

Formula (1) was derived under 
of independent requests. In reality, however, a 
rejected request is automatically resubmitted 
during the next machine cycle. One correction 
was introduced in CDUB81] to take into account 
the wasted cycles due to memory conflicts in the 
computation of Poe The behavior of any one pro- 


(1) 


the hypothesis 


cess is described in the Markov graph of Fig. 2. 
Wis the state corresponding to a wasted cycle 
due to memory conflicts, and A is an active cy- 
cle, during which a processor may issue a new re- 
quest. Solving for (a, 29), the stationary pro- 


bability distribution of the states, one finds 


Pp 


a. oe J - 
Wn ptrciop.) # ay = 17 a - — 


From the graph of Fig. 2, r is defined more 
precisely as the probability of referencing the 
memory during an active machine cycle. In the 
absence of memory conflicts, all machine cycles 
are active. Because of the memory conflicts, 
memory references are also made during each wast- 


ed cycle. The effective rate of memory access 
cycles is thus 
= es 
re = ray F Wy r4P_Ci-r) <3) 
and Equation (1) becomes 
r 
Rs 5 | (1 "| (4) 


Equations (3) and (4) define an iterative process 
by which one can compute roe for given M, P and 


Ps | 

In a typical asynchronous MIMD algorithm, P 
processes share L critical sections. Outside of 
a critical section, a process can proceed freely. 
However, only one process can be executing a 
given critical section at any given time. Typi- 
cally, the execution of critical sections con- 


sists of updating one or more common variables. 
The fluctuations of their execution times, which 
are mainly due to memory contentions, are often 
small. Data-dependent fluctuations can also be 
present (e.g., conditional modification of a 
shared variable). 

A process in an_ asynchronous 


MIMD algorithm 


can be seen as a succession of cycles. A cycle 
consists of two phases in which a process is in 
the non-critical-section phase or in_ the 


critical-section phase. More specifically, a cy- 
cle consists of some processing followed by a re- 
quest for a critical section, a possible waiting 
time to obtain the right of access, and the exe- 
cution of the critical section. This behavior 
can be modeled by a closed queueing network with 
a population of size P, a P-server node (PN) and 
L single-server queues, as shown in Fig. 3. Each 
server queue is for a critical section (CS). 

In the following sections, an approximation 
for this closed queueing network is presented. 
We begin with the simple case in which L = 1, 
then generalize it to an arbitrary value of L. 
Simulations have shown that the model is adequate 
for a wide range of systems and for CS's with low 
coefficients of variation (say, less than .5). 
The coefficient of variation of the independent 
processing time (CVT) has Little influence on the 
model prediction. When it is increased beyond 1, 
the approximation deteriorates very slowly. Fi- 
nally, the model is not appropriate for the case 
in which CVT = 0, and the CS's and routing are 
deterministic. These figures are given to indi- 
cate the domain of validity of the model. 

The model shown in Fig. 3 is similar to the 
model for time-sharing systems with L computing 
centers and P users CKLE76]. The terminology 
used in the approximation is borrowed from such 
systems. The processes are called "jobs." The 
independent processing time is the "think time," 
and its mean is noted by T. The critical sec- 
tions are referred to as "servers," and the mean 
execution time of each server is S (when L = 1) 


or S. (when L > 1). Further notations will be 


introduced in the following discussion. 


3. An Approximation to the G/G/1//P Queue 

Two queues of the G/G/1//P class, namely the 
M/G/1//P queues (CJAI68] and the D/D/1//P queues 
CKIN78], have exact solutions. Based on this 
class of queues, we hereby propose a simple ap- 
proximation to an algorithm with one critical 
section (Fig. 4). There are P jobs, one proces- 
sor node (PN) with P servers and one single- 
server queue which represents the critical sec- 
tion (CS). Under the stochastic assumption, each 
job has a probability X of being outside of the 
CS Cor, equivalently, of being in the PN). In 
such a_ state, the job can request access to the 
CS. By the ergodic property, X is also the aver- 
age fraction of "think time" CKLE76] within each 
job cycle. From the value of X, one can derive 
the mean properties of the network. For in- 
stance, the mean job cycle time, denoted by C, is 


related to X by the formula X = t - Let 


I(t) = i,t), 1,00) ,.00,ip (t)), 


1 2 
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with 1,6)? = 1 iff job j is not in the server, 
and 1,0) = 0 iff job j is in the server at time 


ts 
I(t) is called the indicator vector for the CS at 
time t. Each of its components indicates whether 
a given job is present in the server or not. 


Theorem 1. 


For any G/G/1//P queue in equilibri- 


E E Asse + oX = 1, (5) 
where Eig eenei! is the expected value of the 
product of the components of the indicator vector 


and p = Pu, with u = 3, 


Proof: Let x, be the probability that the server 


is busy. Equating the flows of jobs in and out 
of the server at equilibrium, we obtain 


1 ee ar Xx 
or X= px. = 
Ss 
On the other hand, 
X. = Prob [ "at Least one job is in the server" ] 
= 1- Prob [ “all the jobs are not in Cs" ] 
= 1 - Prob | “i, igeeeip = 1" (7) 
= 1 oe E E ijdent| e 
The Last equality results from the fact that 
the expected value of a random variable taking 


only the values 0 and 1 is equal to the probabil- 
ity of the variable being 1. 

The theorem results from equating (6) and (7). 
CI 


Corollary 1.1. (Approximate Model) 
If Varloressrip are independent random vari- 


ables, then 


E [iyige+tp| =E [i,] : e[i,]---e[%,| = xP 


and xP + oX = 1. (8) 


Below, we give three properties of the approx- 
imate model. These properties will be proven in 
a later section for a more general case. 


P1. Equation (8) always has a unique solution 
x. between 0 and MINC1,4) 
weed. 2 


For a constant P, x, "behaves" as the function 


ai when u tends to 0. 


The approximation x, obtained from equation 


(8) is better when p is small. In this case, the 


waiting time is small and a job cycles as if it 
were the only one present in the network. The 
independence assumption is thus valid. 
P3. For a constant u, Lim eX, = 1. 
P+ 
In CKLE76], these properties are shown to hold 
in the general case, i.¢e., 
1 1 
2 + — 
X Tu" for P << 1 7 
1 1 
o— + — 
and X = bo for P >> 1 Ta 
A consequence of property P3 is that the ap- 
proximation of equation (8) is still good when 
the hypothesis leading to equation (8) (no in- 
terference between jobs) is most violated. 
Unfortunately, X_ cannot be a bound for allt 


a 
systems. It 1s very easy to prove that the 
D/D/1//P queue CKIN78J is such that X > x, for 


all P and uz. On the other hand, it is not diffi- 
cult to find examples of M/M/1//P queues with 
X<X._,. 
a 
4. Discussion and Heuristics 


Evaluating E| agsect| is analytically impossi- 


ble for most cases. Faced with such a complexi- 
ty, we resort to extensive simulations. One in- 
teresting theoretical result that should guide 
us, however, is given in CPRI76], where it was 
shown that, among all M/G/1//P systems, the 
M/D/1//P has the Largest value of X (Cand thus the 
best performance). Price also showed that when 
the coefficient of variation of the server (CVS) 
becomes large, the performance of the M/G/1//P 
queue depends very much on higher moments of the 
server's distribution. 

For values of CVT and of CVS Less than .5, the 
hypothesis of the model is violated because the 
job flow is practically deterministic (CKIN78], 
and the interactions between the jobs in the net- 
work are very Large. The approximation performs 
best for a short deterministic service time. 
Indeed, large instances of the service time are 
more Likely to result instantaneously in Longer 
queues and in more interactions between jobs. 
Some simulation results are summarized in Tables 
1 and2. In all cases, an offset exponential and 
a hyperexponential were used for the cases of a 
coefficient of variation less than one and 
greater than one, respectively. The model param- 
eters have been selected such that (P-1) u = 1. 
This case is one of the most difficult to esti- 
mate, since it is an intermediate point for which 
the results of properties P2 and P3 do not apply 
CKLE76 pp. 208-209]. The relative error in X is 
less than 5% in most cases (the errors Larger 
than 5% are underlined). The approximation wor- 
sens slightly when the number of processors and 
the CVT increase. It is not adequate for the 
cases when both distributions are either exponen- 
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tial or deterministic. In such cases, we can use 
the M/M/1//P queueing network CJAI68] or D/D/1//P 
queueing network CKIN78]. 


Extension to Multiple Critical Sections 
We now consider the more network, 

shown in Fig. 3. In this network, a job stays in 

the PN for a random "think time" and then 


branches to any one critical section, CS., i = 


The 
mean processing time of cs. is Sie Let I, (t) be 


1peee,L with a branching probability P.. 


the indicator vector for the k-th critical sec- 
tion. 
T(t) = Oy 460), I, 20t pacer plt)), 
for k = 1 p66 Sp lee 
The definition of each component is the same 


as in section 3. Each component i, jo? of the 
r 
vector indicates whether the job from processor j 
is in the k-th critical section or not, at time 
t% 
Theorem 2. For each critical section, 
i i ae | + . = 
ELi, atk, 2 1, pt Py X= 1, (9) 
Pe 
where = uy P and u = 


Again, X is the fraction of time spent in the PN. 


Proof: The proof proceeds as for theorem 1 and 
1s omitted here. CJ 
Theorem 3. Within the framework of the job in- 


dependence hypothesis leading to equation (4), an 
approximate solution for the model of Fig. 2 is 


L 
K+ = YF a- owl’? . 10) 
k=1 
Proof: Given the independence hypothesis, equa- 
tion (9) becomes 
P 
Xr + py °X = 1, (11) 
where XE is the fraction of time spent by each 
job outside CS. 
On the other hand, 
L 
Prob("job is in PN") = te 2 cop ek is in CS"), 
k= 
L 
or Xx=i- Ya- Xi (12) 
k=1 
Combine equations (11) and (12) to obtain 
(10). CJ 
Formula (10) is the approximate model. Note 
that at the solution, we must have 
1 > pL X, k=1,ece,L. (13) 


Equation (10) is obviously a generalization of 


(8). a: a 
The existence of a solution to equations (8) 
and (10) is established below. 


Theorem 4 Equation (10) has a unique real solu- 
tion, Xo? such that 
1 
O < X, < MIN(1, », where p = MAX {0,3}, 
‘a Pm AX Max k k 
k=1,c0c,le ‘i 
Proof: We give a graphical proof. We initially 


assume that Pmax <1. When X increases from 0 to 


1, the L.H.S. of equation (10)) increases mono- 
tonically from (L-1) to L, whereas the R.H.S. de- 


L 
creases monotonically from L to > (1,9 1/F < 
k=1 


L. There must be an intersection point for a 
value of X between O and 1. 
The alternative is Puay 2 Is 


In this case, 


when X increases 


from 0 to L , the L.H.S. of 
Max 


equation (10) increases monotonically from 
to 


(L-1) 


+ L-1, while the R.H.S. decreases mono- 


Max 
tonically from L to 


L 
= 1/P - 
2 Cp, / ya) < (Lat). 
Again, an intersection point must exist for X 
between O and L ~ tJ 
Max 


Now that the existence of a unique solution is 
proved, one can find it by iterative methods or 
by graphical methods. 

Let Umax ede and Pmax = 
Lowing two theorems 
behavior of Xo° 


Pu - The fol- 


illustrate the asymptotic 


Theorem 5. 


i tees : 


+ OC On.) * (14) 


Because p.X <1, 1=1,..-,L, a first order ap- 


proximation of equation (10) yields 


L 
fe ae 2 
X + (L-1) =L Pa Xu; + OC oy. 2, 


from which the claim can be easily derived. 


For a constant P, the first term of equation 


(14) dominates when Una x tends to 0. This first 


term is also the value of X obtained by neglect- 
ing the waiting time at each queue CKLE/76). 
Theorem 6. For constant Us, 1=1,000,L, 


ah Omax X = 1s (15) 
Proof: As X+L-1 > LU - xP and 
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Pua 1, we have, for all P, 
X+L-1,P 
1 > Puax X > 1 ae aaa ‘ 
If S. #0, then X < 1, and a <1, so that 


the claim is proved. 
CI 


Note that X = 


is the asymptotic value ob- 
Max 
tained when one server becomes a_ bottleneck 
CKLE76]. Theorems 5 and 6 show that the approxi- 
mation of equation (10) is correct for asymptotic 
cases. It can be expected to be a good approxi- 
mation for intermediate values of the parameters; 
7.@., values such that 


L 
Te - 5, U. 
p =» =! ____ FKLE76 pp. 220-2211. (16) 
Max 
The results for a uniform and a_ triangular 


branching probability distribution are shown in 
Table 3. The approximate model is in agreement 
with the simulations performed for various possi- 
ble values of CVS. and CVT. Note that the ap- 


proximation is good for the case when CVS -=CVT=0 


because of the random routing, which destroys the 
correlation between jobs, exhibited, for example, 
in the D/D/1//P system. The maximum possible 
value for X is .5, because Sy. = T. 


6. The Processor Utilization 

The average processor utilization, denoted U, 
is used to evaluate the degree of matching 
between an architecture and an asynchronous algo- 
rithm, as modeled in sections 3 through 5. Be- 
sides asynchronous algorithms, the model can also 
be applied to evaluate processor Lockout in the 
context of the multiprocessor operating system. 
In both cases, a processor is busy while it is 
outside and inside of the critical sections. 
Idleness is caused by blocking at the entrance to 
the critical sections and by memory conflicts. 

For the G/G/1//P case, the processor utiliza- 


tion, U can be found as follows. The total time 
during which a processor is busy within each cy- 
cle is 

m= (T+S), (17) 
where T and S are the mean "think" time and the 


mean service time respectively. oA is the coef- 


ficient which accounts for memory conflicts. In 
general, a fraction r of machine cycles contains 
a reference to the memory, and each memory re- 
quest is accepted with a probability Poe If the 


request is not accepted, it is resubmitted during 
the next machine cycle. Under these conditions, 
we have shown that (see equation (2)) 


fi eee 
S ae te 
P. 


a 


CrP, + FS anes 


where P. is given by equations (3) and (4) if we 


assume spin. locks. On the other hand, the time 
taken by each cycle is m= C. The average pro- 


cessor utilization is thus 


(T+S) °q 
een aout» em ee a er = eVYe Ss 
U C Tig T X qn X°(1 + 7 (19) 
5 
= a, °X° ae a 


where $4 qa, ° S) and Tc a, ° T) are the mean 


time to execute CS and the mean think time in the 
absence of memory conflicts, respectively. 

This formula can be generalized to the case of 
L critical sections, provided that X is defined 
as the fraction of time spent in the processor 
node per cycle through the network, and S is re- 


L S 
placed by 2 P:S.. In equation (19), (1 + =X 
i= fe) 
represents the penalty due to the synchronization 
at the critical section. 

The relative error in the estimation of X is 
matched by a similar added error in the estima- 
tion of U. Whereas the estimation of X from the 
model is reliable, the estimation of R, the mean 
response time, from X can introduce an unaccept- 
able error in R. To show this, we recall that X 


==, where T is the mean think time and R_ is 


the mean response time. Let ¢, and Ep be the re- 


X 
lative errors in X and R, respectively. Then 
- AX A dX ot A_R_ , AR A Gy_yy . AR 
fy ~ X = dR AR X ~ T+R R = (1-X) R e 
a “Xx 
Hence, &e = 7-x° 


It can be seen that if ey #0, then c¢« can be 


R 
very Large as X tends to 1. 

However, it can be easily seen that the rela- 
tive error in C is equal to that in X, because X 


= te Therefore, the cycle time can be estimated 


with the same relative error as that in X. How- 
ever, even for a small error in X, the error in R 
can be Large. 

As an example, we analyze a system with P in- 
dependent and identically dependent processes 
sharing L identical critical sections. We denote 


S 
the ratio = by bot By equation (19), 
O 
U = Cn X (1 + EO ° 


X is the solution of equation (10): 


- _ (Xtb—-1,P 
x= pe fh - OED]. 


P 
a 


qa 1s obtained as qn = r+ PCr) 3 


with 


r 
Pa baP h “1 M | , and re r+ P(r) ; 


Recall that M is the number of memory modules 
and is considered fixed. P is the number of pro- 
cessors participating in the algorithm execution, 
and its maximum value is M. The performance 
features are P, r, L, Boe These features gen- 


erate a 4-dimensional space of which two plane 
cuts are displayed in Fig. 5. The cuts illus- 
trate the combined effects of the contention for 
the critical sections and the memory modules when 
P = 64, M = 64, and L = 16 or 64. As L goes from 
16 to 64, Eo (the critical-section to think-time 


ratio in the absence of memory conflicts) becomes 
the dominant feature over r (the probability of a 
memory reference per active machine cycle) in the 
typical operating region (r greater than .6). 


? Conclusion 

A simple approximation to estimate the proces- 
sor utilization in asynchronous MIMD algorithms 
has been presented in this paper. The model as- 
sumes that the time taken by the execution of a 
critical section is deterministic or has a_ low 
coefficient of variation. 

The validation of an approximation with such a 
broad applicability requires extensive simula- 
tions. Only selected results have been reported 
here. The approximation has been compared to the 
simulation results for the G/G/1//P (Tables 1 and 
2) and the more general system of Fig. 3 (Table 
3). Note that the model does not include’ the 
service discipline at the queues. However, the 
simulations were run for a FCFS (First-Come- 
First-Served) policy. To appreciate the quality 
of the approximation from Tables 1 and 2, one 
should keep in mind that the most difficult cases 
to estimate are the ones corresponding to inter- 
mediate values of the parameters, as defined by 
equation (16). 

Such simple approximate models have great im- 
portance in the understanding of multiprocessor 
program behavior. They permit software designers 
to compare alternative implementations and to es- 
timate to which degree a given algorithm is fit 
to be executed on a tightly coupled system. For 
example, one interesting property of the model is 
that it depends only on the number of processors, 
critical sections and their total traffic Bae 


This implies that, within the Limits of validity 
of the model, the speed-up is equally affected by 
the wait on short but frequent critical sections 
and the wait on long but infrequent critical sec- 
tions provided that the total traffic is the 
same. However, short critical sections require 
more additional instructions to open and close 
the critical sections. 
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Intercennection Network 


Figure 1. Tightly-coupled MIMD processor Figure 3. System with L critical sections 
shared by P processes 


Figure 2. Markov graph for computing se 


\ eG 
\ L=16 (plain) 
\ =64 (dotted) 


Critical Section to Think Time Rate in the Absence of Memory Conflicts, 50 


Probability of a Memory Reference per Active Machine Cycle, r. 


Figure 5. Feature planes for asynchronous algo- 
rithms in a tightly coupled system 
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THE AUTOMATED DESIGN OF TASK-SPECIFIC 
PARALLEL PROCESSING ARCHITECTURES 


Matthew O. Ward 
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Ls introduction 


Interest in parallel processing has mainly 
stemmed from a requirement to perform vast amounts 
of computation at high speed on sometimes large 
quantities of data. For example, processing of 
video information in real-time for applications 
such as robotics requires analysis to be performed 
on data arriving at rates as high as 15 million 
picture points per second. Most systems to date 
which are capable of this speed are built almost 
entirely of special purpose hardware, which is 
expensive, time consuming to design and develop, 
and often restrictive in its applications. More 
flexible systems, such as SIMD machines, can per- 
form tasks very quickly but are restricted as to 
the complexity of the tasks they can perform as 
well as being often difficult to program. 


The research presented here is an attempt to 
overcome some of these problems. The goal is to 
take a set of algorithms which one wishes to 
execute very frequently, and automatically design 
an MIMD machine capable of executing the tasks at 
a desired speed. In the case of robotics these 
tasks will include image preprocessing, feature 
extraction, object classification, arm guidance 
and monitoring, and accessing and updating the 
knowledge base of the environment in which the 
robot works.» 


In this and many other environments designing 
systems around algorithms is a reasonable approach, 
due to the frequency of execution for each algo- 
rithm and the importance of high speed. Granted 
this methodology could not be used to develop a 
general purpose system, but it is fairly agreed 
upon that no single computer system is capable of 
satisfying all computing needs without exorbitant 
cost and underutilization. A major restriction is 
that once a system is designed the set of algo- 
rithms to be run on it are fixed, although it may 
be conjectured that algorithms in the same 
restricted environment may show enough similarity 
in data and control flow characteristics to allow 
fitting new algorithms to the architecture. 


2. System Components 


The basic subtasks involved in the research 
approach presented here are as follows. 


2.1 Extraction of Parallel Processable Tasks from 


sequential Programs 


In a given program there are two types of 
parallelism which one can detect and utilize atthe 


statement level without significantly modifying the 
Original code. 
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The first is noting when pairs of © 


statements are mutually data independent,[2] i.e., 
one is not reading from a veriable which the other 
is attempting to write into. These are termed 
statement independent. The second type is recog- 
nizing when one iteration of a loop is unaffected 
by the progressing of another, i.e., iteration 
ordering is irrelevant. This is termed iteration 
independency. Thus, by comparing all pairs of 
statements as well as analyzing all loops one can 
learn for each occurrence of each statement in a 
program's execution the earliest time it can be 
executed (firing condition) as well as the latest 
point at which its results are needed by other 
statements (reset condition). The definition and 
format of these terms was introduced by Dervisoglu, 
[9] although the extraction method used in that 
work did not always produce correct results. 


Two conditions must be true for a statement 
to execute properly, namely control flow which 
indicates the statement must execute at some time 
to insure correct results, and valid data must be 
available to use in calculations. These conditions 
together constitute a firing condition, which may 
be represented simply by a list of all statements 
which must execute prior to the firing of the 
given statement. Since multiple control paths 
may exist for each statement there may exist sev- 
eral possible conditions, of which at most one may 
be true at a given time. Thus, a sum-of-products 
representation is used, with the product terms 
being the individual statements and the sums being 
the separation of distinct paths. For example, in 
the following section of code statement 5 cannot 
execute until either the true branch of 2 fires or 
statement 4 executes. Statement 2 is a control 
component while all others are data components. 
Note the reduction by precedence rules. 


xe nty (1) 
If | (x<m) (2) 
m = m/y (3) 
x = x/y (4) 
r=xt+n (5) 
p(5) = 1*2t + 1¥2rk) 
= 2t +h 


Once the parallelism is extracted a simula- 
tion is performed of the parallel execution of the 
program. This is done to ensure that sufficient 
savings in execution time are possible, thus, 


meriting a continued effort at designing a corres- 
ponding hardware architecture. This is an 


important stage as it has been found that many 
algorithms do not lend themselves to significant 
parallelization, either due to the form of the 
particular implementation of the algorithm or the 
nature of the algorithm itself. 


2.c Processor Allocation and Communications 


Requirements 


As a first step towards designing architec- 
tures to fit particular algorithms the system 
must attempt to determine the minimum number of 
processors needed to take advantage of all of the 
available parallelism while at the same time rea- 
sonably minimizing the amount of interprocessor 
communications, as this is the main bottleneck of 
any well-balanced parallel processing environment. 


Although it can be readily shown that the optimiza— 


tion of either of these problems is NP-complete 
satisfactory results can often be derived using 
partially analytic and partially heuristic-guided 
construction in a bounded amount of time. 


Maximum parallelism can be easily insured by 
assigning tasks to processors such that no two 
tasks on the same processor will ever be ready for 
execution at the same time. This is directly 
derivable from the firing conditions by noting 
that tasks which have data dependencies or control 
conflicts may reside on the same processor. Some 
methods used for reducing the number of processors 
as well as interprocessor communications include 
asSigning processes in order of decreasing inter- 
process communications as estimated by approximat- 
ing loop counts, and limited lookahead in evaluat-— 
ing more globally the cost incurred in assigning 
a process to various processors. 


At this point, knowing the processes which 
will reside on each processor, estimates of both 
processing and storage requirements of each proc- 
essor and some general information concerning 
interprocessor communications will be known. This 
information is useful both in avoiding excessive 
system cost by specifying minimum component 
requirements and also helping to decide how to 
group components of similar requirements in the 
event that cost or component constraints require 
"collapsing' of the resulting architecture. 


2.3 Architectural Specifications Based on 


Functional and Communications Requirements 


Given the functionality (processing and data 
and control communications) requirements of the 
algorithms in their parallel form, an architecture 
must be designed with the appropriate functional 
capabilities. The previous two sections have out-— 
lined the extraction and grouping of the process-— 
ing and communications characteristics required, 
and this information is now used in conjunction 
with an architectural components knowledge base to 
design one or several hardware configurations cap- 
able of executing the algorithm in parallel. 


Some of the information included in this 
knowledge base are details of the computing cap- 


abilities of processing elements, size and address- 


ing means of memory, and bandwidth and control 
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protocol of links. Perhaps the most important 
information is that of interfacing specifications 
for each device, thus, avoiding the design of 
‘impossible’ hardware architectures. 


The procedure then is to first locally match 
processor requirements with processor capabili- 
ties and interprocessor traffic with link band- 
width, and then refine selection using compat— 
ability relationships, working outward until a 
totally defined, compatible (able to be inter- 
faced) system is produced. Obviously provisions 
must be made to resolve deadlocks in the proce - 
dure, i.e., when there exists no alternatives 
which allow components to be linked. This often 
will entail decreases in speed or increases in 
cost, which will be user specified. 


2.4 Compiling Parallel Processable Tasks Into 


Architecture Dependent Executable Form 


Once an architecture has been designed and 
constructed the tasks assigned to each processor 
must be converted to an executable form, includ- 
ing message passing protocol, firing and reset— 
ting expression evaluation, and, of course, the 
program code itself. The simplistic operating 
system required on each processor to perform these 
tasks has several advantages over those on exist- 
ing distributed systems. Firstly, the ordering 
of operations is totally deterministic in that 
all essential orderings are preserved by the 
parallelization process. In addition, communica- 
tions is less a problem than in general purpose 
systems, as more is known of the interactions 
between processors which will take place. Thus, 
it is possible to do much pre-execution antici- 
patory work. 


The basic series of tasks to be executed on 
each processor of the system will be as follows. 


a. Receive a message concerning the firing or 
resetting of a statement. 


b. Evaluate firing and resetting conditions for 
all statements awaiting this message on the 
processor. 

ec. If a firing condition is true then 

cl. Gather required data for execution. 

ce. Execute the statement. 

c3. Send messages and possibly resulting 
values to all processors which are 
awaiting its completion. 

d. If a reset condition is true then 

dl. Reset the state for that statement 
so the firing condition is again 
evaluated for future reexecution. 

d2. Send messages to all processors 


which areawaiting the resetting. 


As many of these tasks will be identical in 
form for each task and processor a hardware imple- 
mentation of many of these components is logical, 
especially in communications and expression evalua- 
tion. This is important, as these tend to be the 
major bottlenecks in parallel processing systems. 

cr Current Status 

At the time of writing a significant percent- 
age of the system has been designed, implemented, 
and tested. The work described in Sections 2.1 
and 2.2 has been completed, accepting as input 
normal programs written in a large subset of C and 
producing firing and resetting conditions as well 
as processor assignments. The knowledge base has 
been designed and a skeleton for the entering and 
querying of information has been completed. A 
study is underway to determine the significant 
attributes needed to describe architectural com- 
ponents to use in the automated design process. 
Likewise, an algorithm for creating hardware archi- 
tectures using the knowledge base and the algorithm 
requirements has been designed, the implementation 
of which will be completed when the knowledge base 
is available. A system for compiling the tasks 
into executable modules for a test bed of Motorola 
68000 microcomputers has been completed. Other pro- 
cessors can be easily incorporated with a suitable 
cross-compiler and a small number of processor- 
specific I/O routines. Finally, a communications 
processor is being designed to reduce losses in 
speed due to communications and condition 
evaluation. 


The simulated parallel execution of several 
algorithms has been compared to corresponding 
sequential execution to check for both equivalence 
in results and estimated speedup. Processor 
assignment for a number of short programs (< ho 
lines) has been checked against optimal assign- 
ment located by analytic (exhaustive search) 
methods with highly satisfactory results. Assign- 
ments made for larger programs, although difficult 
to thoroughly assess, have been fairly satisfac- 
tory, although it can be seen that additional 
heuristics may be beneficial. 

4. Conelusions 

A general description has been presented of a 
methodology for automatically designing special- 
purpose parallel processing architectures given 
the tasks which are to be performed. Results to 
date have been quite encouraging as to the effec- 
tiveness of the technique. Obviously, the method 
would be relatively useless in designing general- 
purpose systems, unless a set of representative 
algorithms could be devised which would cover a 
nearly complete spectrum of program types. It is 
believed that this is not possible, agreeing with 
the idea that no single architecture could ever 
satisfy all possible user needs. A major deter- 
minant in the effectiveness of the method was the 
actual implementation of the algorithms used as 
input. It was observed that minor changes in the 
implementation could result in major increases in 
parallelism and reduction of interprocessor com- 
munications. Thus, as work progresses a set of 
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rules is being developed as a guide for writing 
programs to best exploit parallelism, several of 
which will be implemented into a precompiler to 
relieve the user of the need to modify his or her 
programming style. 
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A BIT-SEQUENTIAL MULTI-OPERAND INNER PRODUCT PROCESSOR 


H.J. 


SIPS 


Delft University of Technology 


Abstract- A bit-sequential multi-operand inner 
product processor is described with 0(2N.n) com- 
plexity where WV is the number of product pairs and 
nm is the wordlength of the operands. The operands 
have variable precision. The result is available 

d clock cycles after the absorption of the last 
bit of the operands. The parameter d is a small 
positive constant. 


I. The algorithm 


The summation of product terms is necessary in a 
large number of scientific computations such as 
matrix manipulations, signal processing, etc.. The 
inner product function is defined as: 


N 
Poe BeBe. (1) 


To achieve a high computing speed most two operand 
sequential processors transport their data and 
variables in a bit-parallel manner. If instead of 
a sequential processor a parallel processor ap- 
proach is taken this results in the use of many 
bit-parallel buses. This imposes severe restric— 
tions on the constructability of a parallel com- 
puter system. These restrictions do not disappear 
in VLSI. As a means to overcome computational and 
interconnection complexity the use of bit-sequen- 
tial processing can be considered. Swartzlander et 
al.[ SWAR78] considered a quasi-serial implementa- 
tion of the productterms using a parallel counter 
multiplier. 

Becausé bit-sequential processing intrinsicly slows 
down the computation time it is very important to 
use the datatransport lines effectively i.e. the 
bits on the in- and output operand lines should be 
Significant on each time step. On-line and semi on- 
line algorithms have this property. On-line algo- 
rithms are defined by the property that to generate 
the j-th digit of the result it is necessary and 
sufficient to have the operands available up to the 
(j+d)-th digit, where d is a small positive con- 
stant. On-line algorithms with the least signifi- 
cant digit first have been developed among others 
by Atrubin [ ATRU65] and Chen [CHEN79]. Trivedi et 
al. [TRIV77] has developed algorithms for the most 
Significant digit first using a redundant number 
system. On-line algorithms are only efficient if 
long expressions have to be evaluated. If, however, 
a recurrent equation is solved the result must be 
delayed by (n-d) time steps. 

Semi on-line (SOL) algorithms are defined [ SIPb82] 
by the property that d clock cycles after the ab- 
sorption of the last digits of the operands the 
first digit of the result is available. The param- 
eter d is now a small positive constant which is 
independent of the word-size m and for which holds: 
d(s)<d(p). d(s) is the semi on-line delay and d(p) 
is the delay in a full bit-parallel implementation 
of the arithmetical operation. In the multi-oper- 
and case we can further demand that the hardware 

is of O(M.n) complexity where M is the number of 
operars and n is the wordsize. the effective delay 
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of the semi on-line algorithm is (n+d) cycles. 
The operation (1) is evaluated in (n+d) clock 
cycles using semi on-line algorithms. The condi- 
tion d(s)<d(p) implies processing of the operands 
during transportation. Semi on-line algorithms 
can also be defined with the most significant 
digit first or with the least significant digit 
first. 

The number A is represented by the bitstring 
{a(nJa(n-1)...a(1) where a(1) is the bit that is 
entered into the processor first. If A is a sign- 
magnitude number bit a(1) is always the sign-bit. 
If A is a two’s complement number the bits proceed 
in normal order. The algorithm is based on the 
technique of incremental multiplication [ TRIV77] 

[ CHEN79]. The algorithm is extended for multi- 
operand operations and two’s complement numbers. 


Let: 
A. Stas RAL) va Qes 
g (kia {az (ka, (k-1)...a5(1)} (2) 
A tk) is A. up to the k-th bit. From (2) follows: 


i ata, t 
A (k)=A (Kk 1) + a(k).2 (3) 


t=-k+1 and k>1 for MSB-SM (a(1)=sign bit) 
t=k-2 and k>1 for LSB-SM (a(1)=sign bit) 
t=k-1 and k>1 for LSB-TC 
t=-k and k>1 for MSB-TC 


AA 0)=0 for TC; A (1)=0 for SM 
It then follows that; 
t 
A.(k). ° 7 e ° eaee ° oe e ° e 
A ) B (Kk) gik 1) Bo (k 1)+A -(k 1) b(k) a°+ 
B.(k).a.(k).2" (4) 
Suppose; J J 


= —t 
P AKIRA o(k).Bo(K).2 ‘ (5) 


FA O=0 for TC, P (1)=0 for SM 


From this the following recurrent equation can be 
derived; 


9S = | 
Po(k)=2 9 .P(K~1)44 -(K-1).B,(K)4B (K).a(k) (6) 


s=-1 for the MSB first algorithms 
s= 1 for the LSB first algorithms 


This can be done for every product in equation (1); 


N 
P(k)=2 <P(k-1)+ ) {A ,(kK-1).B(K)+B (Kk) a. (k)} (7) 


j=l 


Each recursion step the evaluation of (7) requires 
the full addition of 2N+1 operands. 

The generation of the partial product for sign- 
magnitude numbers is straightforward according to 
(6) since the magnitudes can be interpreted as 
positive numbers. The signs of A. and B,. determine 
whether the partial product P(k)*is weighted as a 
positive or a negative number in (7). 

A two’s complement number can be expressed as: 


n-] 1-1 
A=-"a(n) + Yialt).2” (8) 
t=] 
(LSB first case) which means that all the bits 
besides bit a(n) can be treated as if they where 
positive. The bit a(n) gives a negative weight to 
B(n) in equation (7). 


II. Implementation 


To achieve enough speed in solving (7) the opera- 
tion must be done in a pipelined way. There are 
two phases to be distinguished in the evaluation 
of equation (7): 
1. Compression of the partial operands in a sum 
and a carry vector. 
2. Addition of 1. and the shifted partial product 
generated in the previous time step. 
The phases 1-and 2 can be done in a pipelined way. 
For large values of W internal pipelining of phase 
] may be necessary. The approach is to use a carry 
save cellular array of dimension 2N.n. In this case 
the delay is linearly dependent on the number of 
operand pairs NW. An example of a carry save adder 
array is shown in figure 1. The cells of the array 
consist of full adders. The array produces a sum 
and a carry vector. Note that in figure 1 the top 
row can be deleted for positive operands. The b, 
operand can be directly feeded into the full 
adders of row 2. The elements are included for re- 
gularity since additional logic must be included 
in each cell. 
For the addition in equation (7) the operands must 
be in two’s complement form. If the operands are 
sign-magnitude numbers they must be converted to 
two’s complement numbers. When both operands in the 
multiplication process are sign-magnitude numbers 
and have the same sign they are already in the 
correct form so no conversion is needed. When the 
operands have opposite signs the weight of the 
product pair will be negative. The complementing 
of both partial operands can be done by complemen- 
ting the individual bits of the operands and adding 
a 1 to the operand. The addition of a 1 to the 
operand is the same as putting a 1 on the c. in- 

: : : ; nm . 
puts in figure 1 if the corresponding operand 1s 
negative. So the sign bits in the sign-magnitude 
format are not involved in the computation of the 
partial operands according to equation (7). The 
combination of the sign bits of the operand pair 
only determines the positive or negative weight of 
the total operand pair. 

When both operands are two’s complement numbers 

the operands are already in the desired form. Only 
the sign bit must be interpreted as a negative 
weighting factor. Here the sign bits of the oper- 
ands are explicitly involved in the computation of 
the operands accordig to equation (7). 

For the sign determination of the result the method 
of Agrawal et al. [AGRA78] is followed. This method 
prevents the extension of the operands by inverting 
the sign bits and adding a fixed correction factor 
C to the sum (C inputs in figure 1). 

The additign of the righthandside terms of eq. (7) 
besides 2 .P(k~1) can be done in the way described 
above. The result has ent!log,2Nl+ n bits. Each 
time step the partial product generated in the pre- 
vious time step must be added to the sum of the 
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partial operands. This can be accomplished by 
(5,3) counters. This is the same method as in 
[ CHEN79] ,[ SIPa82] . These counters are shown in fi- 
gure 1. The outputs of the counters (S,C(1),C(2)) 
are wired according to the algorithm (MSB or LSB). 
The counters contain three storage cells to save 
the partial product. Figure 2 shows the wiring for 
the MSB and for the LSB algorithm. It can be seen 
that in the LSB wiring of the (5,3) counters only 
nearest neighbour interconnections occur. | 
The carry save adder array description in figure | 
does not include the control mechanism needed to 
generate the partial operands. The aim must be to 
design a cell which is simple of structure and has- 
a minimum of interconnections to the outside world. 
From eq. (7) it can be seen that each time step a 
new bit is appended to the operands. Whether or 
not the operand participates in the addition of 
that time step is dependent on the new bit of the 
corresponding other operand. The minimum needed 
is a full adder, one storage cell to hold the 
operand and a few gates. Figure 3 shows the basic 
cell layout. The operand line ( here B. has been 
chosen) is a line along all cells wheré the oper- 
and is to be stored (rowwise). They succesively 
activate each colum. The control signal q(k+1) 
loads each new bit of the operands in the next 
column. The control line g=a.(kK) is the one step 
delayed value of one bit of the corresponding 
operand A.(k) and qualifies the operand B.(k) 
according” to eqation (7). The treatment éf A.(k-1) 
is a little different because of the differente 
in the kK-index. Therefor the new bit a.(k) must be 
suppressed. This is indicated by the signal q(k) 
which is the one step delayed signal q(k+1). The 
complement signal determines whether the operand 
is positive (com=0 and c. =0) or negative (com=1 
in 
and Crd). 
After processing the last bit of the operands the 
result must be available as soon as possible. In 
the MSB first case the result is stored in the 
(5,3) counters. A fast carry propagating adder is 
necessary to determine the result. The propagation 
delay determines the delay d of the operation. 
The result is then in two’s complement form. If a 
result in sign-magnitude is required an extra 
complementing step is needed. This can be done 
while the sign bit is placed on the output. In 
the the LSB first case there is no fast carry 
propagating adder necessary if the result is re- 
quired in two’s complement form because the MSB 
half of the product can be calculated during the 
output transfer of the result. There is, however, 
in overlapped computation an extra (5,3) counter 
counter hecessary because the accumulating (5, 3) 
counter is needed in the next inner product eval- 
uation. If the result has to be in sign-magnitude 
form a fast carry propagating adder is necessary 
to determine the sign bit. 
Figure 4 shows a numerical example of the LSB 
first two’s complement algorithm. 
Another property of the multi-operand processor 
is the improved dynamic accuracy. If the inner 
product has a mixture of positive and negative 
operands the multiplication of product pair j may 
overflow in the sense that the result contains 
more than ” bits without causing an overflow of 
the final result. (This is shown in figure 4). 
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(a) MSB first 


A «* B + C€ * D 
101110001010 + 001110001011 100110 
-18 * 10 + 14 # 11 -26 


A(0).b(1) 00000 
B(1).a(1) 00000 
C(0).d(1) 00000 
D(1).c(1) 00000 
corr.fact 


LSB first 


00000 | | 
P(0) | 00000 0 Figure 2. (5,3) counter wiring 
P(1) 00008 


GTi in 
A(1).b(2) 00000 
B(2).a(2) 00010 
C(1).d(2) 00000 


D(2).c (2) 00011 b, (k+1) 


j 


- 00101 
P(1).2 00000 
P(2) 10 0 00101 

F 


g= a, (k) 


com 


etc. 00000 
00010 
00000 
00011 


00101 
00010 


0 001lll 
F 


Figure 4. Numerical example Figure 3. Basic cell layout 
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Abstract -- Digit online arithmetic has a 
great deal of potential for the speedup of com- 
putation. Digit online algorithms have the prop- 
erty that in order to generate the j-th most sig- 
nificant digit of the result it is sufficient to 
have the first j+tk most significant digits of the 
operands. The difference k is a small predefined 
constant corresponding to an online delay. This 
paper presents a software package that simulates 
the operation of computational, systems which use 
digit online arithmetic. The simulator provides 
the ability to investigate the advantages and 
disadvantages of using digit online arithmetic 
for various applications. 


Introduction 


In recent years a good deal of research has 
been directed towards digit online algorithms and 
their corresponding architectures [2],[3],[4],[5], 
[6], [7]. These algorithms may be realized by 
special arithmetic systems, one of which uses digit 
online pipelines. Online algorithms have the 
property that in order to generate the j-th most 
significant digit of the result it is sufficient 
to have the first j+k most significant digits of 
the operands. The difference k is a small prede- 
fined constant. Thus, after a startup delay of k 
steps an online algorithm will generate one digit 
of the result at each step. 

The advantage of using digit online pipelines 
is demonstrated in systems that involve the 
chaining together of many pipelines. Machines 
such as the CRAY-1 use the technique of chaining 
on words to achieve the fastest processing speed. 
This involves connecting the output of one pipe- 
line to the input of another. If two conventional 
pipelines are chained together in this way the 
second pipeline cannot begin processing until the 
first pipeline has produced its first result. On- 
line pipelines are not strung end to end but side 
to side to achieve chaining on digits, creating an 
online pipeline network [3]. The attractiveness 
of online pipeline networks can be seen most dra- 
matically in the computation of recursive equa- 


tions. In such a network the computation of feud 


may begin as soon as the first digit of f. becomes 


available. The improvement in processing speed 
over conventional pipeline networks can be dra- 
matic. 

This paper presents a software simulation for 
operations and expressions evaluated using digit 
online algorithms first presented in [4]. The 
simulator implements floating point addition (sub- 
traction), multiplication, division, and square 
root in a fully digit online manner. The simula- 
tor was designed to provide the ability to create 
and analyze a wide range of digit online pipeline 
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quiring a small number of gate delays [4]. 


16802 


networks. In this way it helps to determine the 
advantages and disadvantages of using online 
arithmetic to solve various problems. 


Design of the Simulator 


The simulator package consists of about two 
thousand lines of PASCAL code on a VAX 11/780 
system running UCB VMUNIX. The programs were de- 
signed to be highly interactive and easy to use. 
Consequently they can be used to demonstrate the 
concepts of digit online arithmetic to those who 
are unfamiliar with this subject. The simulator 
can be run so that the operand digits are re- 
quested from the user as they are needed and the 
result digits are displayed as they become avail- 
able. In this way the operation of the algorithms 
may be directly observed. 

The programs simulate the algorithms RADD, 
RMUL, RDIV, and RSQR presented in [4]. The algo- 
rithm NORM has been implemented to help normalize 
operands. The digit online algorithms were de- 
signed in such a way that they may be implemented 
with a limited number of hardware primitives re- 
The 
four primitive operations implemented in the sim- 
ulator are: 


1) Selection of a fixed point value based on 

a two digit value, This operation may be 
performed using 2 gate delays in a simple 
table lookup fashion. 

Addition of two fixed point values, This 
operation is performed by a signed digit 
addition that requires 4 gate delays. 
Multiplication of a fixed point value by a 
single digit value. This operation takes 

6 gate delays. 

Shifting a fixed point value by a constant 
value. In hardware this operation could be 
accomplished by simply offsetting the inter- 
connections between corresponding components 
in 0 gate delays, 


2) 


3) 


4) 


All of the more complex operations in the floating 
point algorithms are manipulated so that they can 
be expressed in terms of these four simple func- 
tions. Each iteration of an algorithm is simu- 
lated by a series of calls to these primitive 
functions producing the desired result. In this 
way the simulation carries out the operations in 
the same way that they would be performed in hard- 
ware by a digit online pipeline, The simulator 
also keeps track of the number of gate delays that 
have elapsed at the end of each step. This will 
give the user an idea about the processing speed 
of the network. 

The algorithms RADD, RMUL, RDIV, and RSQR 
operate by taking one digit of each operand per 
iteration and generating approximations to the 


characteristic and mantissa of the result in the 
fashion of the traditional continued sums/products 
algorithms [1]. The algorithms are online with 
respect to their inputs but since the result is 
not available until the final iteration they are 
not online with respect to their output. Fortu- 
nately, there are algorithms that when given the 
approximations to the result out of these algo- 
rithms, can generate the result in an online man- 
ner. Two of these functions were programmed into 
the simulation. 

The first such function is the discretization 
algorithm DISC [4]. When supplied with the ap- 
proximations to some result z, DISC will generate 
z in a digit online manner. By using DISC, the 
operations of addition (subtraction) and multipli- 
cation will be performed with an online delay of 
one. Both division and square root will have an 
online delay of three. A problem with DISC is 
that it tends to generate unnormalized results. 
The algorithm RADD, for example, always preshifts 
the mantissa of the result one position to the 
right to avoid mantissa overflow, so usually the 
result will be unnormalized. Unnormalized results 
increase the error in a system [5]. They also 
cause problems when these results are used as the 
input to a process that does not accept unnormal- 
ized operands such as RDIV. 

Another digit generating algorithm, MOSN, may 
be used to decrease the probability of unnormal- 
ized results. MOSN is constructed so that the 
last characteristic digit of the result is not 
computed until the first approximation to the man- 
tissa of the result becomes available. Using the 
first mantissa approximation, MOSN determines how 
many places the mantissa can be shifted to the 
left without causing overflow. In this way, MOSN 
is able to normalize many results that would 
otherwise be unnormalized. The online delay of 
algorithms using MOSN will be one greater than the 
delay out of DISC. Table 1 shows a comparison of 
a divide operation using DISC and MOSN. As may be 
seen from this example the result when using MOSN 
is closer to the correct result. The penalty for 
using MOSN is an additional step. The simulator 
uses redundant base 8 arithmetic. 


Table 1 - A Comparison of DISC and MOSN 


Result of RDIV and DISC. 


NUMBER 1: Began at time = 0. 
(1.803125E+01) is the dividend b. 
(9.617408E+06) is the divisor c. 
(1.877546E-06) is the quotient a. 
Ended at time = 288. 


PROCESS 
02:2216 
10:4534 
13:0414 


Result of RDIV and MOSN. 


NUMBER 1: Began at time = 0. 
(1.803125E+01) is the dividend b. 
(9.617408E+06) is the divisor c. 
(1.874752E-06) is the quotient a. 
Ended at time = 320. 


PROCESS 
02:2216 
10:4534 
12:4143 


The actual quotient is 1.874855E-06. 


Experimental Results 


One of the major accomplishments of the simu- 
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lator to date has been to clearly demonstrate the 
areas of concern that exist when using digit on- 
line arithmetic. By simulating various online 
pipeline networks, the magnitude of these concerns 
was observed. The simulator also served as a use- 
ful tool for finding solutions to some of these 
concerns. The primary concern is generation of 
unnormalized results. When unnormalized values 
occur in a network, gradual mantissa underflow 
may occur [5] since an unnormalized value may 

not be as precise as it could be. By shifting 

the mantissa left to remove leading zeroes, more 
digits of the result may be added increasing the 
precision. An even bigger problem with unnormal- 
ized values in a network is that they may not be 
used as the divisor in RDIV or as the radicand in 
RSQR. 

The algorithms RADD, RMUL, and RDIV all have 
the possibility of generating unnormalized re- 
sults. The algorithm RSQR however, is an inter- 
esting exception. RSQR will operate correctly 
provided that the mantissa is normalized and pos- 
itive. If the radicand satisfies these require- 
ments then the result of RSQR will also be normal- 
ized. When combined with DISC, RSQR will compute 
the square root in an online manner with an online 
delay of three. This result will be normalized, 
so it may be used as the radicand of another 
square root operation or the divisor of a division 
operation. Thus, it is possible to construct an 
online pipeline network to compute the n-th root 
of x, where n is a power of 2, provided that x is 
positive and normalized. This network will have 
an overall delay of 4(log,n)-1 steps. Table 2 


shows the simulation of a network to compute the 
eighth root of a number. 

Unfortunately, the algorithms RADD, RMUL, and 
RDIV do not have the desirable property that they 
always generate normalized results. In fact, when 
combined with DISC, these algorithms will usually 
generate unnormalized numbers. MOSN only decreases 


Table 2 - A Network to Compute Eighth Root 


tr ot 
e 


Enter the expression(s) (Type to stop.): 
13 as=es=' bs=" a. 

Enter the 2 characteristic digits of a. 1 4 
Enter the 6 mantissa digits of a. 2765 7 0 
PROCESS NUMBER 1: Began 
14:276570 (2.559993E+10) 
12:510400 (1.600000E+05) 


Ended 


time 0. 

the radicand a. 
the square root b. 
time = 352. 


at 
is 
is 
at 


time = 128. 
the radicand b. 
the square root 
time = 480. 


PROCESS NUMBER 2: Began 
12:510400 (1.600000E+05) 
03:620000 (4.000000E+02) 

Ended 


at 
is 
is 
at 


Cc. 


time = 256 

the radicand c. 
the square root d. 
time = 608 


PROCESS NUMBER 3: Began 
03:620000 (4.000000E+02) 
02:240000 (2.000000E+01) 

Ended 


at 
is 
is 
at 


the probability of unnormalized results. But no 
algorithm which can truly be said to be digit on- 
line can guarantee that the result will be normal- 


ized for all possible values of the operands, es- 
pecially in the case of cancellation during sub- 
traction. One way to guarantee that the result 
will be normalized is to put restrictions on the 
operands so that cases such as the above do not 
occur. ' . 
Increasing the online delay of certain oper- 
ations is another way of guaranteeing that a value 
in an online pipeline network can be normalized. 
The algorithm NORM has been incorporated in the 
simulator to provide the user with the ability to 
specify the online delay of a process. NORM re- 
ceives the inputs in an online fashion and gener- 
ates a result, with the same value as the operand, 
after an online delay of one. If the first digit 
of the operand is a zero, NORM will be able to 
shift the mantissa one position to the left and 
adjust the characteristic accordingly. NORM is 
also written so that.by recursively applying NORM 
to an unnormalized number enough times, that num- 
ber will eventually be normalized. Thus, an un- 


normalized mantissa such as 0.100...001, will 


become 0.100...017, after one application of NORM, 
and 0.077...7. after n applications of NORM. 


The simulator package has been used to simu- 
late many practical online pipeline networks. One 
such network is for the LU decomposition of an 
n by n tridiagonal matrix using the recurrences: 


f5 PO 


d. = b 


- alles oi (a,/d,_1) for l<i<n-l 


The computation Le = a,/d,_5 is treated as a de- 


sirable by-product. Table 3 shows the simulation 
of one iteration of the LU decomposition for 


2 3 00 Lo oo6] [2 3 0 0 
6 20 7 o|_|3 1 0 oJ, jo 11 7 O 
0 22 5 5, j0 2 1 0] |o 0 -9 5 
0 0 36 4 [0 0 -4 1] |0 0 O 24 


This example shows the advantage of. using digit 

online arithmetic. As can be seen from the start- 
ing and ending times of the processes, the compu- 
tations of Las d,> and & are performed in paral- 


lel. 
One problem with this application is that the 
result diy may be unnormalized and therefore the 


2 


computation of Qe = a,/d,_, may be incorrect. One 


solution to this problem would be to restrict the 
values of a,» b,, and C; in such a way that d. 4 


will be normalized. These restrictions would ob- 
viously reduce the usefulness of the system. An- 
other solution is to apply the algorithm NORM to 
the divisor. | 


Conclusions 


This paper has presented a highly functional 
simulator for digit online algorithms. Since it 
was designed to perform in the same way that a 
straightforward hardware implementation of the al- 


gorithms would operate, the simulations of. online © 


pipeline networks will experience the same prob- 
lems that the actual networks would encounter. The 
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02:110000 (9.000000E+00) 


Table 3. - LU Decomposition 


PROCESS NUMBER 1: Began at time = 0 

01:600000 (6.000000E+00) is the dividend a[1]. 

01:200000 (2.000000E+00) is the divisor d[0]. 

01:300000 (3.0000E+00) is the quotient 2[1]. 
Ended at time = 384 


PROCESS NUMBER 2: Began 
01:300000 (3.000000E+00) 
01:300000 (3.000000E+00) 


time = 160. 
the multiplicand c[0]. 
the multiplier 2[1]. 
the product. | 
Ended time = 480. 
PROCESS NUMBER 3: Began time = 256. 
02:240000 (2.000000E+01) the addend b[1]. 
02:110000 (-9.Q000000E+00) is the augend. 
02:130000 (1.100000E+01) is the sum d[1]. 
Ended at time - 576. 


PROCESS NUMBER 4: Began at 
02:260000 (2.200000E+01) is the dividend a[2]. 
02:130000 (1.100000E+01) is the divisor d[1]. 
01:200000 (2.0000E+00) is the quotient 2[2]. 
Ended at time = 736. 


time = 352. 


possibility of unnormalized divisors occuring in 
a network is one such problem. The need to avoid 
unnormalized values leads to tradeoffs in proces- 
sing speed, restrictions on inputs, and the pre- 
cision of results. The simulator allows the user 
to investigate how these tradeoffs come into play 
for various applications. 
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Abstract —- Speech understanding is a complex 
task which requires extensive computation. To increase 
the processing speed, a speech understanding system can 
be decomposed into tasks which can be performed by a 
series of distributed processing sub-systems. An archi- 
tecture to perform acoustic processing is described in 
this paper. The parallel architecture for acoustic pro- 
cessing calculates characteristic parameters which 
describe the input speech signal. The types of opera- 
tions performed include digital filtering, FFTs, linear 
predictive coding, autocorrelation calculations, and pitch 
analysis. The architecture is a multiple-SIMD system 
using the MC68000 microprocessor as the basic process- 
ing element. Using realistic assumptions from existing 
speech understanding systems, the attributes of the 
parallel system to perform acoustic processing for real- 
time speech understanding are derived. In particular, 
details about the organization and the number of pro- 
cessors in each of the component SIMD sub-systems are 
obtained. Interconnection network requirements are 
determined from the SIMD algorithms used. Timing 
analysis is performed. 


I. Introduction 


A speech understanding system accepts spoken 
speech input, derives a conceptual understanding of the 
input, and produces a response. In a typical system, a 
number of knowledge source components interact to 
resolve the errors and ambiguity inherent in human 
speech. These knowledge sources perform operations 
such as acoustic parameterization, phonetic interpreta- 
tion, lexical processing, syntactic analysis, semantic 
interpretation, and response generation. Existing speech 
understanding systems that have been developed are 
described in [6], [8], and [16]. 


The extensive computation required precludes real- 
time speech understanding on a conventional serial com- 
puter. To improve the processing speed, the different 
knowledge sources can act in parallel (possibly on 
different portions of an utterance), and in addition, com- 
putational tasks within each knowledge source can be 
performed in parallel. Advances in technology have 
made it realistic to consider large-scale parallel process- 
ing systems [e.g., 2, 5, 13]. By designing multiprocessor 
knowledge sources, real-time speech understanding (with 
a constant delay) should be achievable. The next sec- 
tion briefly outlines a general configuration for a mul- 
tiprocessor system for speech understanding. In the fol- 
lowing sections, a detailed description and analysis of a 
parallel architecture for acoustic processing is described. 


II]. A Parallel Architecture For Speech Understanding 


An architecture proposed to handle the speech 
understanding task consists of a distributed series of 
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computation stations [3, 4]. Each computation station 
corresponds roughly to a speech understanding 
knowledge source. This distributed parallel architecture 
is diagramed in Fig. 1. The interconnection of 
knowledge sources forms a linear pipeline in which each 
stage is a complete multiprocessor sub-system. 


A typical computation station consists of an input 
memory buffer (MB), an output MB, control units 
(CUs), and processing elements (PEs). The organiza- 
tion of the PEs within each computation station 1s 
selected to exploit whatever parallelism is inherent in 
the specific task being performed by that station. The 
processing time for each station is a function of the 
computational complexity of the tasks to be performed 
and the amount and arrival rate of input data. Assum- 
ing a maximum input rate, the processing speed require- 
ments can be met by employing parallelism within the 
task algorithms and also among the tasks to be per- 
formed. Minimum processing time for the computation 
station will be insured when the data in the input MB is 
processed as soon as it is available. When a subset of 
PEs has finished a processing task and stored its result 
in the output MB, it is available to be assigned another 
task by the computation station’s primary control unit. 


Each computation station is specialized to meet 
performance (speed) requirements of the overall system. 
Processing proceeds asynchronously with respect to 
adjacent computation stations. When the processing 
time for each station is approximately equal, then no 
bottlenecks occur and data flow through the system will 
be continuous, providing real-time performance (with a 
constant delay). Because the parallelism within each 
computation station permits processing of all probable 
utterance hypotheses simultaneously, there is no need to 
backtrack once any particular hypothesis has been 
determined improbable. Thus, extensive parallelism 1s 
being used at each stage of the speech understanding 
process in order to simplify the interaction among vari- 
ous knowledge sources. 


Ill. Acoustic Processing 


Acoustic processing is the task of transforming 
periodically sampled digitized speech into characteristic 
time and frequency domain parameters. Acoustic pro- 
cessing is described in [15] and [24]. 

The number and type of parameters used by the 
major speech understanding systems vary with each sys- 
tem. The complete set of parameters calculated, called 
characteristic parameters, represents a segment of 
speech data called a frame. A frame can range from 5 
to 20 ms in length and corresponds to a uniform section 
of an utterance. A frame length of 12.8 ms is used by 
the architecture described here. For each frame, 37 
characteristic parameters are calculated. In order to 
achieve real-time performance, the 37 parameters must 
be calculated in at most 12.8 ms. 


The speech data is sampled at 20 KHz. Therefore, 


a 12.8 ms speech data frame contains 256 data samples. 
This 256 point data set is called the short data set. 
Most of the acoustic parameters are calculated from 
these data points. Other parameters, especially those 
relating to the pitch of the speaker’s voice, require a 
longer segment of speech data containing several vocal 
cord oscillations. For these parameters, a 51.2 ms seg- 
ment of speech data, consisting of 1024 digitized sample 
points, is used. This data set, called the long data set, 
includes the current 12.8 ms frame plus the preceeding 
38.4 ms of speech data. Both data sets are completed 
and available for processing simultaneously. The 
parameters calculated from both data sets characterize 
the interval of speech of the short data set. 


The 37 characteristic parameters are listed below. 


Al — A24 Linear predictive coding (LPC) coefficients. 
The predictor coefficients uniquely specify the 
transfer function of the vocal tract. 


The normalized autocorrelation coefficient at 
unit sample delay. This is a rough measure of 
the uniformity of the data within the frame. 


Signal energy within a High frequency band 
(5000 — 10000 Hz). 


Signal energy within a Low frequency band 
(625 — 2500 Hz). 


Signal energy within a Mid frequency band 
(2500 — 5000 Hz). 


Signal energy within the Total frequency range 
(0 — 10000 Hz). 


Signal energy within a Very Low frequency 
band (0 — 625 Hz). The energy within the 
speech signal characterizes the overall vocal 
tract configuration. 


LPC normalized minimum error. This parame- 
ter reflects the accuracy of the linear prediction 
model for describing the speech frame. 


Fundamental frequency (pitch). The fundamen- 
tal frequency indicates the oscillation rate of the 
speaker’s vocal cords. 


First formant (resonant) frequency. 
Second formant frequency. 


Third formant frequency. The values of the 
first three formant frequencies are useful in the 
characterization of vowels and sonorants. 


Root mean square energy of the preemphasized 
speech signal. 


Zero crossing density. The zero crossings can be 
used to separate fricative from non-fricative 
speech sounds. 


The algorithms required to obtain these parameters will 
be discussed in section V. 


IV. An Architecture For Acoustic Processing 
The SIMD Architecture 


The architecture to perform acoustic processing 
within the speech understanding system is called the 
Acoustic Processing Computation Station. This is the 
second stage of the speech understanding system 
diagramed in Fig. 1. The Acoustic Processing Compu- 
tation Station is diagramed in Fig. 2. It consists of a 
primary CU which coordinates processor activity, 4 
secondary CUs, an input MB, an output MB, 512 PEs, 
and a multistage cube interconnection network. The 
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Fig. 1. A distributed speech understanding system. 
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Fig. 2. The Acoustic Processing Computation Station. 


computation station receives its data from the Input 
Processing Computation Station. It calculates the 
characteristic time and frequency domain parameters for 
frames of the input data and stores the results in the 
output MB. These parameters are then accessed by the 
Segmentation Computation Station. The components of 
the computation station form a  multiple-SIMD 
(MSIMD) system designed to exploit the parallelism 
inherent in acoustic processing tasks. 


The Processing Element 


The MSIMD acoustic processing architecture uses 
512 PEs. The PE model used is based on that 
presented in [7]. All control and arithmetic operations 
within the PE are performed by an MC68000 micropro- 
cessor. A memory management unit will arbitrate all 
read and write operations to the microprocessor. When 
the CU broadcasts an instruction to the PE, this 
instruction is stored within an internal instruction 
memory and subsequently read by the MC68000. The 


CU can enable or disable the PE by utilizing masking 
instructions. The CU can also read various condition 
codes from the PE. The internal memory is used for 
data storage and is used only by the PE. 

The MC68000 microprocessor is a powerful 16-bit 
device with 56 instruction types, 14 addressing modes, 
and eighteen 32-bit internal data and address registers 
[11]. Processor timing calculations were made with the 
microprocessor running Motorola’s 68343 fast floating 
point software [12]. The execution times are calculated 
for the microprocessor running with a 12.5 MHz clock 
frequency. Processing times for arithmetic operations 
are given in Table 1. 


The Interconnection Network 


Each PE is connected to all of the other PEs in the 
computation station by a 16-bit multistage cube inter- 
connection network with independent box control [17]. 
The cube network can be partitioned into independent 
sub-networks of varying power of two sizes, allowing 
subsets of the PEs to act as independent SIMD 
machines. Routing through the network is established 
with routing tags generated by each PE. The multis- 
tage cube interconnection network was chosen because 
of its extremely high efficiency when performing many 
parallel processing algorithms. Network transfer times 
used are based upon the simulation studies in [1]. Net- 
work transfer times for different data types are given in 
Table 2. These times include the times for routing tag 
generation and configuration of the network based on 
the routing tag specification. 


V._ Algorithms 


The calculation of the 37 characteristic acoustic 
parameters requires many signal processing operations. 
To achieve the necessary processing speed, each signal 
processing task was divided into parallel sub-tasks or 
algorithms that can be run on an SIMD machine. Nine- 
teen parallel signal processing algorithms were used. 
Each of the SIMD algorithms is such that it can run on 
machines of different sizes, with execution time a func- 
tion of the machine size. The processing time for the 


computation station can be adjusted by varying the 


Table 1. Processing times for the MC68000 with a 12.5 
MHz clock. 
OPERATION TIME (ps) 
Integer (16 bit) 
Add/Subtract 0.4 
Load 0.4 
Store 0.4 
Floating Point (32 bit) 
Add/Subtract 14.1 
Divide 48.6 
Multiply 28.2 
Square Root 124.2 
Load 0.8 
Store 0.8 
Compare 1.6 
Absolute Value 0.8 
Negate 1.6 
Complex (2 * 32 bit) 
Add/Subtract 28.2 
Multiply 141.0 
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Table 2. Network transfer times for each data type. 
These times include time to generate routing 
tags and set the network. 


DATA TYPE TIME (ps) 
Integer (16 bits) 4.5 
Floating Point (32 bits) 6.4 
Complex (2 * 32 bits) 10.1 


number of PEs executing each SIMD sub-task. 


The nineteen parallel algorithms used by the 
Acoustic Processing Computation Station are listed 
below. The first number after each item in the list 
designates a reference to the calculation performed seri- 
ally. Subsequent numbers indicate references to a paral- 
lel algorithm. 


Autocorrelation Calculation [15, 19, 22] 
Center Clipped Signal Construction [15, 20] 
Data Zero Padding 
Digital Inverse Filter [15, -] 
Energy Band Calculation [24, -] 
FFT [14, 21] 

Radix 2 Decimation-in-Frequency (DIF) 

Radix 2 Decimation-in-Time (DIT) 
Formant Frequency Analysis (9, -] 
Hamming Window [15, -] 
LPC Coefficients Calculation [10, 20] 
LPC Minimum Error Calculation [10, -] 
Maximum Calculation [-, 23] 
Minima Calculation [-, 23] 
Normalized Autocorrelation Calculation [18, -] 
Partial Autocorrelation Calculation [15, 19] 
Pitch Extraction [15, 20] 
Preemphasis [15, -| 
RMS Energy Calculation [24, - 
Squared Magnitude Operation [15, -] 
Zero Crossing Calculation [24, -] 


Several of the parameters computed by the Acous- 
tic Processing Computation Station are obtained by 
combining a number of these algorithms. The 256 point 
Autocorrelation Calculation used to obtain the LPC 
coefficients is computed by combining four parallel algo- 
rithms: Data Zero Padding, Radix 2 DIF FFT, Squared 
Magnitude Calculation, and the Radix 2 DIT FFT. 
The Digital Inverse Filter used in obtaining the formant 
frequencies is composed of four parallel operations: Data 
Zero Padding, a Radix 2 DIT FFT, a Squared Magni- 
tude Calculation, and a DBR — WRP Network Data 


Transfer. 


Interaction points occur at the end of one algorithm 
and the beginning of another, when the parallel algo- 
rithms must interact to synchronize and exchange data. 
The specific problems which must be addressed at the 
interaction points are sub-system size and data alloca- 
tion. The number of PEs used to execute each algo- 
rithm is determined by the real-time constraints. 
Therefore, the number of PEs operating on a given data 
set may change. In addition, the way in which data is 
assigned to the PEs may differ from one algorithm to 
the next, or the results from one algorithm may not be 
allocated in the pattern needed by the next algorithm. 
To simplify algorithm interaction, four distinct data 
orderings are defined for the algorithms used. For D 
PEs and data items d(0), d(1), ..., d(D-1): 


Single Sequential Order (SSQ): 


PE p- contains d(p) 


For D/2 PEs and data items d(0), d(1), ... 
Dual Bit Reversed (DBR): 


, d(D-1): 


PE p. contains d(br(2 * p) 
| and = d(br((2_ * p) + 1)) 
where _ br(x) = bit reverse of x 
Dual Sequential Order (DSQ): 
PE p contains (2 * p) 
and = d((2 * p) + 1) 
Wrap Around Order (WRP): 
PE p- contains  d(p) 
and = d(p + D/2) 


Each of the above three data orderings can be general- 
ized to D/2' PEs for D data items where 1 is an integer. 


In addition to the parallel signal processing algo- 
rithms listed above, five data alignment/data transfer 
algorithms were designed and used by the computation 
station: 


Load Data in DSQ Order 

Distribute Data 

DBR — WRP Network Data Transfer 
DSQ — DBR Network Data Transfer 
Network Data Broadcast 


The Load Data algorithm is used for the initial assign- 
ment of data to the PEs. The Distribute Data algo- 
rithm distributes data by copying the data points from 
one set of PEs to another preserving the data ordering. 

The two Network Transfer algorithms perform realloca- 


tion of the data to obtain the specified data ordering. 
(For the algorithm sequences considered, no other reallo- 
cations were needed.) A Network Data Broadcast is a 
network data transfer in which a single data item i 
transferred to each PE in an SIMD machine. | 


VI. Operation and Performance 


A different series of parallel algorithms is executed 
by the computation station on the short data set and on 
the long data set. The algorithms performed on the two 
data sets constitute two synchronous algorithms that 
are run asynchronously with respect to each other. Fig. 
3 shows the assignment of principal algorithms to PEs 
and the algorithm processing times for one 12.8 ms seg- 
ment of speech. The architecture consists of 512 PEs. 
PEs 0 through 255 are assigned the algorithms which 
are performed on the short data set. PEs 256 through 
511 are assigned the algorithms which are performed on 
the long data set. 


Assume that the short data set is available at time 
0.0. At that time, the 256 data samples are loaded into 
PEs 0 through 256. These data samples are then 
transferred to the remaining PEs by using the intercon- 
nection network. The calculations on the short and 
long data sets proceed asynchronously from this point. 
The partitioning of the Acoustic Processing Computa- 
tion Station’s PEs to perform all of the above algo- 
rithms can be easily seen in Fig. 3. 

A summary of processing times and the number of 
PEs utilized for each parallel algorithm are given in 
Table 3. When completion of an algorithm results in 
the generation of a characteristic parameter, that 
parameter is indicated after the algorithm name. The 
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Fig. 3. Assignment of tasks to PEs and algorithm processing times for one 12.8 ms segment of speech. 


(Labels indicate the principal processing steps.) 
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Table 3. Summary of the processing times and the 


number of PEs used for each algorithm. 


PARALLEL ALGORITHM # PEs TIME 
(ms) 
Short Data Set Algorithms 
Subsequence One 
Load Data in DSQ Order 256 0.205 
Distribute Data 512 0.006 
Hamming Window 256 0.028 
Distribute Data 256 0.006 
Preemphasis 256 0.049 
Autocorrelation Calc. - 256 point 
Data Zero Padding 256 0.001 
Radix 2 DIF FFT 256 1.857 
Squared Magnitude Calc. 256 0.085 
Radix 2 DIT FFT 256 1.857 
Distribute Data 128 0.013 
Network Data Broadcast 32 0.160 
LPC Coefficients Calc. (Al—A24) 32 5.952 
LPC Minimum Error Calc. (ERN) 32 0.182 
Digital Inverse Filter 
Data Zero Padding 128 0.001 
Radix 2 DIT FFT 128 1.650 
Squared Magnitude Calc. 128 0.085 
DBR — WRP Network Transfer 128 0.013 
Minima Calc. 128 0.259 
Formant Frequency Analysis (F1,F2,F3) 64 0.090 
12.499 
Subsequence Two 
Time after Autocorrelation - 256 point 4.094 
RMS Energy Calc. (RMS) 2 0.174 
Norm. Autocorrelation Calc. (AC) 2 0.284 
4.552 
Subsequence Three 
Time after Autocorrelation - 256 point 4.094 
Zero Crossing Calc. (ZC) 128 0.046 
DSQ — DBR Network Transfer 128 0.013 
Radix 2 DIT FFT 128 1.650 
Energy Band Calc. (EH,EL,EM,ET,EVL) 128 0.327 
6.130 
Long Data Set Algorithms 
Load Data DSQ Order 256 0.205 
Distribute Data 512 0.006 
Maximum Calc. 256 0.261 
Center Clipped Signal Construction 256 0.042 
Partial Autocorrelation Calc. 256 11.714 
Pitch Extraction (FO) 256 0.138 
(1) 12.366 


parallel algorithms which make up the longest synchro- 
nous path of each of the algorithm sequences are listed 
in order. Their processing times are tabulated and are 
indicated in Table 3 as s1, s2, s8, and ll, corresponding 
to the processing times for subsequences I, 2, and 3 of 
the short data set and the subsequence on the long data 
set. These times are also shown on Fig. 3. The process- 
ing time for the computation station will be the slowest 
of the asynchronous sequences. The processing time of 
the computation station is 12.499 ms resulting from the 
processing of the short data set. Since all processing is 
completed before the arrival of the next speech data set 
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( < 12.8 ms }), data flow through the architecture is con- 
tinuous with no bottlenecks, providing real-time opera- 
tion with a constant delay of 12.499 ms. The point at 
which processing is completed is indicated in Fig. 3. 


VII. Discussion 


This work focuses on the problem of using an 
MSIMD system to perform a large number of algorithms 
under real-time constraints. Major issues addressed 
include choice of partition sizes for the component algo- 
rithms, determination of the overall machine size, data 
allocation and alignment at the junctures between algo- 
rithms, and interconnections between the algorithms. 
The design resulted in a 512 processing element archi- 
tecture in which data flow is continuous with no 
bottlenecks. Real-time performance is achieved with a 
constant delay of 12.499 ms. 


At any point during processing, there may be from 
one to four independent SIMD algorithms being exe- 
cuted. The component SIMD machines range in size 
from 2 to 512 PEs. Nine different system configurations 
(i.e., partitionings) are used. These are accomplished 
dynamically, by reassignment of the control units to 
subsets of the PEs. A very rough measure of processor 
utilization can be determined by a ratio of the areas 
during which processors are performing algorithms and 
the total area of a single 12.8 ms frame. This calcula- 
tion results in a processor utilization of about 75: per- 
cent. 


The required processing speed to acheive real-time 
performance was obtained by increasing the number of 
PEs executing the parallel algorithms. The ability to 
acheive greater speed by increasing the number of pro- 
cessors is characteristic of the problem domain of acous- 
tic processing. The flexibility of the MSIMD architec- 
ture presented is particularly well suited for these types 
of problems. 


Many of the algorithms could be executed in less 
time than indicated in Table 3 if more PEs were used. 
However, this would have delayed other parallel algo- 
rithms and real-time performance may not be achieved. 
Other algorithms could not be run any faster because 
the maximum number of processors that can be 
employed in the algorithm are being used. For example, 
the LPC Coefficients Calculation algorithm can use only 
32 PEs. Even though more processors are available 
(Fig. 3.), using more PEs will not decrease the process- 
ing time of the algorithm. The SIMD machine partition 
sizes were chosen in an attempt to create the smallest 
possible overall machine to perform all of the tasks 
within the real-time constraints. 


The types of algorithms to be performed and the 
real-time constraints placed on the system design 
resulted in very constrained algorithm scheduling. In 
order to meet the real-time requirements, each algo- 
rithm must be run on the SIMD machine of the size and 
in the order indicated in Fig. 3. This restriction on the 
size of an SIMD sub-machine indicates a requirement 
that the operating system must acknowledge requests 
for an SIMD machine of a specific size. An open prob- 
lem is whether or not this type of highly constrained 
scheduling could be done efficiently by an automatic 
scheduling algorithm. 


Substantial speed was obtained by utilizing the 
small number of well defined data orderings at the 
interaction points between parallel algorithms. This 
eliminated the need for the architecture to store and 


load data items between algorithms. Distribution and 
reallocation of data required only 0.2 percent of the 
total processing time of the computation station. The 
sequencing of the algorithms was chosen to minimize 
this reallocation time. In most cases, the parallel algo- 
rithms were designed such that no data alignment was 
necessary. For example, the 256 point Autocorrelation 
Calculation is performed by executing a sequence of four 
algorithms. The first algorithm, Data Zero Padding 
accepts data in SSQ order and outputs data in WRP 
order. The Radix 2 DIT FFT algorithm uses the WRP 
orderd data and outputs data in DBR order. The 
Squared Magnitude Calculation preserves the data in 
DBR order, which is then used by the Radix 2 DIT 
FFT. The last algorithm outputs data in WRP order. 
For this sequence of algorithms, no additional data allo- 
cation was needed than that provided by the algorithms 
in the sequence. Because of the speed increases which 
can be gained by avoiding frequent data reallocations, 
an intelligent scheduler should make use of data alloca- 
tion information in sequencing the algorithms and in 
selecting among alternate versions of a given algorithm. 


The work presented in this paper points the way to 
many directions for future work. Additional parallel 
algorithms could be explored. Since there are some idle 
processors during portions of the speech frame analysis, 
additional characteristic parameters could be calculated. 
The addition of floating point hardware to augment the 
instruction set of the MC68000 should be explored. All 
of the algorithms used in this work were deterministic 
and therefore had predictable processing times. Some 
signal processing operations, such as spectrum enhance- 
ment [15], have processing times that may vary depend- 
ing upon the input speech data. Design of an architec- 
ture using these algorithms would require probabilistic 
modeling and computer simulation studies. 


An issue which must be considered in the design of 
large scale systems such as the one presented here is the 
extent to which one wishes to employ special purpose 
hardware. Since speech processing is an_ evolving 
research area, it is desirable to have a flexible system on 
which new algorithms can be tested. This architecture 
provides such a research tool in which the amount of 
parallelism provided can be varied to execute a wide 
variety of algorithms. The design presented in this 
paper demonstrates the feasability of an MSIMD system 
to perform speech acoustic processing within real-time 
constraints. 
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Abstract -- A new cryptosystem based on multiresidue 
codes and on pseudorandom number generation is proposed in 
this paper. Parallel and pipelining computations in encryp- 
tion and decryption can be realized by adopting multiresidue 
codes and the mixed-radix conversion scheme. The difficulty 
of ecryptoanalysis in multiresidue codes is discussed in de- 
tail. A cipher unit for encrypting a 24-bit data block and 
for decrypting a 32-bit data block has been implemented by 
employing a low-cost microprocessor. The implementation of 
the unit is mentioned in this paper. 


Introduction 


Hiding information in secret codes has been spreading 
in communication systems among computers, terminals, or both 
of these. Since the Data Encryption Standard (DES) and 
several new cryptosystems have been presented recently 
{1]£2], we have entered a ecryptograph age. An ideal ecryp- 
tosystem possesses the characteristics of easiness in both 
data encryption and decryption at low cost and those of 
hardness in breaking its cipher. 

A new encryption system based on mixing multiresidue 
codes with a technique in pseudorandom number generation is 
presented in this paper. The proposed encryption system is 
a conventional eryptosysten. The cryptoanalysis is 
extremely exhaustive. The ecryptosystem with simple data 
encryption and decryption has been implemented on a low cost 
microprocessor. The number of moduli (n) and the values of 
the moduli (m1i,m2,...,mn) correspond to keys in the mul- 
tiresidue system. 

The multiresidue system has to satisfy 


2"< LOM(m1,m2,...,mn), 


where k is the length of a data block to be encrypted. 

The cryptosystem employing pseudorandom number genera- 
tion is based on ae block chaining scheme. In the block 
chaining scheme, the present data to be encrypted are influ- 
enced by other data previously encrypted. In the proposed 
systen, information data in the present state to be 
encrypted are affected by both information of the multiresi- 
due code in the previous state and the related pseudorandom 
number. The pseudorandom number is also influenced by 
information of the multiresidue code in the previous state. 

The block chaining scheme has an inevitable drawback. 
Even if any single encrypted data block is dropped from the 
trasmission line or is not received by the decryption sys- 
tem, it would be very difficult or almost impossible to 
decrypt the succeeding encrypted data blocks. 

The implemented cryptosystem employs 24-bit data block 
encryption and 32-bit data block decryption with six moduli. 
The length of one block to be enerypted can be easily 
expanded. 

The data to be encrypted can be converted modulus by 
modulus in parallel. The data decryption can be pipelined 
adopting the mixed-radix conversion method [3]. 

In order to break the implemented cipher, a Marcov 


model of a complete graph consisting of 27* nodes and 224, 2% 


-1)/2 ares has to be solved. Moreover, the difficulty of 
the eryptoanalysis can be enhanced by increasing the period 
of the pseudorandom number sequence. 

The proposed encryption and decryption procedures’ are 


described in Fig.1 and Fig.2 , respectively. 
Conversion from normal numbersto multiresidue codes 


The multiresidue system is composed of multiple moduli 
mi,m2,...,mn which give the usable number range 
M=LCM(m1,m2,...,mn). Let a normal number X be represented 
in a residue form. The residue representation of |x|m is 
then, 


rat Cee {1g Meg wi aw a. 4oKd » 
where xi=|X|mi means that xi is the ith residue of X modulo 


mi. 


* If O<a<m and [ab|m=1 are satisfied, a is called the 


multiplicative inverse of b mod m, and is denoted by 
a=|1/b|o. 
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Example 1: For the moduli m1=6,m2=7,m3=11, the usable 
number range is then 


0 < X < M = LCM(6,7,11) = 462. 


When X is 26, the multiresidue representation of {x | M 
is 


| 26lapS==> {2,5,4}. 


A high speed parallel residue computation algorithm is 
proposed in this paper. When 221 (k=1,2,...) are adopted 
as moduli, every residue is able to be calculated in paral- 
lel as shown in Fig.3. The residues of an n-bit data block 
mod 2K-1 can be parallely calculated on every k-bit block, 
This can be proved using Eq.(1) [3] as follows: 
lelae lol tlelys aeons 


M 


a+b+c+.... 


(1) 


M 


Proof : Consider X mod M. Let XK be an n-bit data and 
M be 2K-~1 (k=1,2,....). 
M=2%—~1=0 mod M 


all Jao +2ay + «2 ber] 2K la bax # 2a «42 Boe 


Q.E.D. M M 


The residues of an n-bit data block mod 2X 41 can be 
calculated on every 2k-bit block in the same manner. 

The proposed parallel computation algorithm is suitable 
not only for multimicroprocessor implementation, but also 
for iterative VLSI implementation. 


Conversion to the Mixed-Radix system 


There are two schemes of conversion from the residue 
system to the normal number system. The one is based on the 
Chinese Remainder Thorem, while the other the Mixed-Radix 
Conversion [3]. We have adopted the latter conversion from 
the viewpoints of the parallelism and the pipeline process- 
ing capability involved. 

The mixed-radix expression is of the form 


nN- 
See ern eer rare (2) 


x 


t=] 
where ai, is the ith mixed-radix coefficient. The al 


(i=1,2,....-,n) are required to obtain the normal number 
representation as shown in Example 2. 


Example 2 : For mi=7, m2=3, and m3=5, find the asso- 
ciated mixed— radix digits of {4,0,3}, where the mixed 
radix expression is 


X=a3°(7°3)+a2e(7)+al. (3) 


Solution Segment 1 


moduli: 


Residue repre- 
sentation of x 


x 
Multiply by| 1/7] 4—> x 
X-a ee Dae ee ae eer! a2=2 
7 
Segment 3 


~a oo - al 
L * 
Multiply by| 1/3 M 
x-al a2 a3=0 
3 


Then the mixed radix representation of xX is 
{0,2,4}. Hence by Eq.(3), one obtains 


%30°(7°3)42°(7)44=18. 


P Lleli ana pipeiens 


Assuming that t processors are employed for converting 
the normal data to t residues, the ith residue of X mod mi 
is computed by the ith processor, where X is the normal data 
and mi is the ith modulus. The maximum throughput of the 
multiprocessor system depends on the slowest residue compu- 
tation. The experimental encryption program, which was 
designed to compute each of the six residues in sequential 
on a single processor,can be divided into the six indepen- 
dent program modules. If the modules are processed in 
parallel on the six processors, approximately. five times of 
the throughtput will be expected to be achieved with some of 


the waiting overhead for synchronization among the six pro- 
cessors. 

On the other hand, the data decryption adopting the 
mixed-radix conversion scheme can be mapped onto the paral- 


lel and pipelined multiprocessor architecture as shown .in 
Fig.4. Fig.4 describes a three-residue decryption system of 
Example 2 in the previous section. Processor Pt is a queue 
which transmits the residues x1, x2, and x3. Processor P2 
performs the calculation of the segment 1 to obtain the 


mixed-radix coefficient a2 shown in the solution of Example 
2. Similarly, P3 and P4 are for the segment 2 and for’ the 
segment 3, respectively. Processor Q performs the conver- 


sion from the mixed-radix system to the normal number system 
following Eq.(3). 

If each queue of every processor provides an adequate 
length, the system requires no centralized synchronization 
at all, because the scheme is based on a data flow mechan- 
ism. When all the required data arrive at queues of a pro- 
cessor, the computation is fired and is autonomously per~ 
formed in the processor to provide the computed result for 
the succeeding processors. In order to convert tt residues 
into the normal data, N processors need to be given in the 
system, where N is t(t-1)/2 +2. Sinee in the decryption 
system the processor Q. shown in Fig.4 for the conversion 
from the mixed-radix system to the normal number system has 
the heaviest load in the computation, the unit time of the 
pipelined processing is determined by the computability of 
the processor Q. 

The experimental decryption program for sequentially 
converting the six residues’ to the normal data can be 
divided into the seventeen independent program modules (t=6, 
N=17). If the modules are processed in parallel on seven- 
teen processors, it could be estimated by the experimental 
decryption program that the unit time diminishes to approxi- 


mately one twelfth of the time which was required for 
decryption on a single processor. 
Pseudorandom number 

The pseudorandom number genarated by a linear 
recurrence modulo 2 "shift register" [1] is utilized as a 
key of the Caesar Cipher [1] in the proposed system. 

When a trinomial of the form xP+xt+1 whose degree is a 
Mersenne exponent is adopted as a generator polynomial, the 
period of the linear recurring sequence becomes 2?F -1 [1]. 


On the other hand, the period of a primitive polynomial 
becomes 2?-~1, where p iS the degree of the polynomial [1]. 


Combinations of various generator polynomials can be chosen 
to generate pseudorandom numbers. a m2 mt 

Let a polynomial H(x) be G1(x)G2(x)....Gt(x). The 
period of the linear recurring sequence becomes 
LCM(n1,n2,....,nt)2'*' , where ni (i=1,2,...-,t) are the 
period of the Gi(x) and the 2**!has to be satisfied as fol- 
lows [4]: 

at< Max(m1, m2, ...., mt) < 2. 


Example 3 : Find the period of H(x)=(x3 +x #1S(xe1 


Solution : Gi(x)=x2+x 41 G2(x)=x+1 
mi=2 m2=3 
n1=29-1=7 n2=2-1=1 
Hence, the period of the H(x) becomes 
a+ 
LCM(7,1)¢2 =7-4=28 xy 
{Aaa < Max(2,3) <2}. 
The length of the generated pseudorandom number 


sequence corresponds to the key length of the Caesar Cipher. 
The number of shifted clocks ina state is influenced by 
multiresidue codes in the previous state. If a trinomial of 
the form xP+x%t+1 whose degree is a Mersenne exponent is 
chosen as a generator polynomial, the generator requires a 
p-bit initial seed (not all zero) and the total number of 
the seeds becomes 2?-1 [1]. 


Cipher unit 


A cipher unit for realizing the proposed encryption and 
decryption has been implemented by the use of a microproces- 
sor Z-80A as shown in Fig.5. RS232C, HDLC, and SDLC inter- 
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faces are realized in the unit employing a SIO chip for the 
Z-80 family. The encryption time and the decryption time 
are 1 ms and 2 ms, respectively. The encryption and the 
decryption programs are stored in a 2k-byte ROM. The number 
of moduli and the values of the moduli can be changed by 
manipulating DIP switches or by replacing the ROM with 
another one. The cost of the experimental unit was approxi- 
mately 70 dollars. 


Strength of multiresidue codes without 
pseudorandom number generation 


It is assumed in this section that abundant encrypted 
data blocks are sampled by a cryptoanalysist and that the 
length of a data block is known. Consider the number of 
required blocks to be sampled. In order to cryptoanalyze 
the multiresidue code, the property of the inclination in 
the probability of the occurrence of 1's in the bit sequence 
of a residue can be utilized. : 

When the modulus m is’ even, the probability of 
occurrence of 1's in a residue is estimated to be 1/2. When 
the modulus m is odd, the probability Pj of the occurrence 
of 1's in the ith bit of a residue is estimated to be r/m, 
patent r is a positive integer. The Pj is satisfied as fol- 

ows: 


the 


(m-2%)/m=pk< .... <pi< .... <pl=p0=(m-1)/2m (4) 


where pO is the probability of the occurrence of 1's in the 
least significant bit of the residue and pk is that in the 
most significant bit. The equation m=1/(1-2p0)=1/(1-2p1) is 
Satisfied by Eq.(4). 


Let pO be the statistical probability of pO. In order 
to investigate whether an odd modulus m is employed or not, 
it is sufficient to examin whether the inequality m-1 < 
1/(1-2p0) < m+1 is satisfied or not . Therefore, 

1/2-1/2(m-1) < po < 1/2-1/2(m+1), 
1/2-p0-1/2(m-1) < p0-pO < 1/2-p0-1/2(m+1), 
1/2m-1/2(m-1) < pO-pO < 1/2m-1/2(m+1), 
~1/2m(m-1) < DpO=-poO < 1/2m(m+1), 

Por po=n( og, (m*~1)/4nm*), 


where N means the normal distribution and n is the number of 
required blocks to be sampled. 


In order to estimate modulus m with the reliability of 
99%, 
(3/2)(1//n) /(m2-1)/m2< 1/2m(m+1), 
gm? (m+1)? (m2-1)/m2c n, 
9(m+1)° (m-1) < 9(me+1)* <n 
Should be satisfied, where me is the estimated maximum 


modulus. For example, me=127 then n~23', When the modulus 
is even, 9(me+1)2 < n is introduced in the same manner. 


The strength of multiresidue codes without pseudorandom 


number generation is determined by the number of required 
blocks to be sampled. The number depends upon the maximum 
modulus. The larger modulus is chosen, the longer block to 


be encrypted is needed. 


Conclusion 


A new cryptosystem based on mixing multiresidue codes 


with a technique in pseudorandom number generation is pro- 
posed. The cryptosystem has been implemented in the low- 
cost cipher unit using a microprocessor. If the charac- 


teristics of parallelism and pipelining involved in the mul- 
tiresidue system are mapped onto multimicroprocessor systems 
or onto iterative VLSI systems, encryption and decryption of 
much higher speed could be achieved. It is investigated 
that the difficulty of cryptoanalysis of multiresidue codes 
depends upon the maximum modulus. It is expected that the 
proposed low-cost cryptosystem will contribute to spreading 
communication with hiding information in secret codes. 
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ABSTRACT 


This paper presents a generalized 
architecture for a parallel/pipeline processor 
capable of performing exponentiation (raising 
some base a@ to a power x in a finite field) in 
O(logy(logox)) time. This device has 
applications in coding and public key data 
encryption. 


INTRODUCTION 


In this paper we present a highly efficient 
parallel processor architecture for finite field 
exponentiation. 

Knuth gave a good technique for 
exponentiation with a single processor 
architecture having time complexity of 

2 (logyx )t where t is the time delay of 
multiplying (or multiplying/reducing in a finite 
field) two N-bit numbers [3]. 

Knuth's algorithm was applied to a two 
processor device and the complexity of the 
procedure was subsequently reduced to 
(Logox) t shown to be optimal for any parallel 
architecture [1]. This bound was based on the 
number of squaring operations required in the 
worst case. We will consider a parallel 
architecture which performs finite field 
exponentiation in O(log, (logs x] )t time. 

This device has applications to coding and 
public key data encryption. In some 
cryptographic systems the generation of some 
public key (Y) from some secret key (x) consists 
of raising some known base @ to the power x in 
some GF(2N) [2], [4]. 


MATHEMATICAL BASIS OF EXPONENTIATION 
Assume we are performing exponentiation in 
GF(2N), 


We can say that x, the desired exponent can 
be represented by some N-bit vector, b. 


b = (by_15 by-9> ee ee 09 by) 


Knowing that a@ Yt2 = (a@Y)( a2) we can 
N-1] 03 
express g@* as a *= [| a? Pi 
1=0 


If the powers of a2 (i = Ors at es 
»N-1) could be provided to a number of 
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multiplication units operating in parallel 
(multiplication being commuative in Galois 
Fields) the time required to produce the desired 
power of the base @ could be greatly reduced. 


ARCHITECTURE CONCEPT 


The structure of processor to be discussed 
is based on the concept presented above. 


The processor architecture is 
topologically similar to that of a binary tree. 
A single multiplication element in our processor 
would correspond to a node in the binary tree 
while a line (parallel/serial) for the 
undirectional transfer of information within the 
processor would in turn correspond to an arc in 
the tree . 

Now consider the procedure by which such a 
machine might compute the product of 2” 
numbers (m,, m9. . . «,M,,). The set of 
numbers would first be partitioned into pairs 
and assigned among the 22-1 multiplication 
elements (ME) at level n. 

The processors at level n would each 
multiply the two numbers assigned to them and 
pass the product to their "father" at level n-l. 
The "fathers" at level n-1 would then multiply 
the values in their registers and pass the 
product to their "fathers" an so on. The process 
would continue until the root processor had the 
value 


on-1 on. 


ae 


in the other. Multiplying the contents of its 
two registers it: 


mj in one register and 


i=] 


2n 


I] 


i=l] 


would produce ,» the desired result. 


If we had half the number of multiplication 
elements (an n-1 level architecture) we could 
partition the 2" numbers by assigning two 
pairs of values to each multiplication element 
on the lowest level. | 

Each bottom level multiplication element 
would proceed by multiplying its first pair of 
numbers and then its second pair which would 
pipeline the products of (mj, - + » M)n-1) 
through to the root first followed by the 
products of (m,y-1 |; > Mnde | 
The top most elément’( an accumulating 


multiplication element above the root of the 
tree structure) would then store the value of 
the product of the first 20-1 numbers until it 
receives the product of the second 2" 

numbers. It would then multiply the two to 
produce the desired product. The delay in 
achieving the result in this design would be 
that of n+l multiplications as opposed to a 
delay of n multiplications in the example with n 
levels. Clearly the processor could have any 
number of levels with the number of levels being 


inversely proportional to the time delay in 
achieving the product. 

To compute a@™* we would, rather than 
multiplying 2" arbitrary numbers as in our 
example, multiply the precomputed powers of 
a selectively based on the binary 
representation of the exponent x to obtain @%*. 

A device was recently developed.which 
produces all necessary powers of a2” in 
time (1)t [5]. In combination the two devices 
can produce any a@™* in GF(2") very 
efficiently. 


GENERAL ARCHITECTURE AND ALGORITHM 


In this section we will present the design 
of a general (J level) processor for the 
exponentiation problem under consideration as 
well as the formal version of the processing 
algorithm. It is important to note that the 
majority of the decision making specified in the 
algorithm will be implemented through the 
hardware in the the multiplication elements. 
Some branching that is specified in the 
algorithm, that which requires that certain 
steps of the algorithm be skipped over in the 
very early and very late iterations, will be 
controlled by a comparator/counter which will 
control clock inputs to the various levels of 
the structure. 


In an efficient implementation, x would 
probably be placed in a shift reel ree with 
appropriate connections to the 2/71 
multiplication elements on level J and shifted 
2J bits at each iteration providing the 
control for that wave of computation. 

There are three distinct types of 
multiplication elements in this architecture. 
All have a multiplication/reduction unit (a 
device capable of multiplying two N-bit numbers 
and reducing the product in GF(2N) in time 
delay t.) They differ only in their decision 
logic. 

Prior to the computation of @™*® the 
powers of a would be distributed two to each ME 
along the Jth level until each ME had two, while 
these are being multiplied the next set of pairs 
would distributed. The process would continue 
until all N of the q@2* had been 


distributed. : ; 
The multiplication elements at level J have 


logic (hardware) to perform the decision 
operations specified in the algorithm. The 
binary representation of x can be considered to 
be a control element for these devices. These 
elements also have one register containing the 
value l. 
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The multiplication elements on levels 1 
through J-1 have a multiplication/reduction unit 
and two input registers A and B. 

The multiplication element used to compute 
the accumulated partial products and ultimately 
produce the result has a logical feedback from 
its output to its B register. It outputs the 
result only when the counter/comparator so 
directs it. Below is the general algorithm for 
computing qa@™* in GF(2N), 

Let us consider the general structure 
of the processor. 


Level 0 
fal fs | 
Level 1 
a] fs | 
Level 2 
fal fs | jal [3 
es Level 3 
a] fs | ier lal [a la | fe 
7 
/ 
/ 
/ 
/ 
Level N-1 


eS 


Level N 


Figure ] 
ALGORITHM 
0 I=0 
1 a) Level J Multiplication Elements 
(MES i) for i=0,1,2, eo ee e 2 - -l 
If (b.. b., i 
(bo ie(ry2d? Paare (zy 2d 


(0,0) Move the value 1 from the register 
bo rai a7? | 
(1,0) Move the value a@ 
to MES-1, [i/2] 
(0,1) Move the value @ 
to MEJ-1, |i/2J 


92i+(1) 29 


92it+1+(1) 29 


(1,1) Move the product of 
o22iH(T)2° 92i+1+(1)29 


to MEJ~1, [i/2] 


by) (level J-1l ME ) (ME 5-1, i) aoe 


i=0, 2. 4, 2972.) 


If 121 multiply contents of A 
register and B register and 
move result to ME 5-9, [i/2] 


(level J-2 ME ) (ME j- 2, i) for 


by_5) 
: 2 i=0, eo e¢ e@ 29 ~3_4 


If I22 multiply contents of A 
register and B register and 
move result to MEj-3, |i/2| 
@ 
® 
@ 


(level 1 ME ) (ME 1) 


bj) 


If I2J-1 multiply contents of A 
register and B register and 
move result to MEQ ,0 
c) Level 0 ME (MEQ , 0) 


If I=J move contents of A 
/ register to B register. 


If J+1 <1 < [w/23]4541 multiply 
A register by B register and store 
product in B register. 
If I= [w/ 25] 4541 out put product 
and STOP. 
2 I —>I+l 


If 1<[N/2J]g0 to 1 a) 


rf fn/23 1<1<fy/ 21+ 3 
b5-(1-[w/231)? 
tf t =[n/2Jlag go to 1 c) 


go to l 


ANALYSIS OF COMPLEXITY 


It was shown earlier that the number of 
levels in the processing stucture influences 
the speed of the exponentiation procedure. 

First we will examine the number of 
multiplication elements in a given architecture 
of J levels. By definition of the topology, 
each level (k) of the processor has twice as 
many multiplication elements as the previous 
level(k-1). Hence the first level(k=1) has one 
ME the second has two, the Kth 2-1. There 
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are J levels plus one accumulating element thus 
a total of: (29-1)+1=25 multiplication 
elements in the design. 

The time complexity of a J-level device can 
be calculated as follows: 

There are 25~ multiplication elements on 
level J. If we wish to compute a@* in GF(2N) 
and x is represented by an N~-bit vector then 
there are at most N values of @“ to be 
multiplied 


If there are 237! proce secre on level J 
then each multiplies[N/297 !| values. Thus 
there are iNjoUieeracion levels _in the 
procedure hence a time delay of |N/2 

We note that partial products are being 
pipelined through the system to the 
accumulating ME at the top level (level zero). 
In order to output the result the accumulating 
ME must receive the product of the last wave 
of values. They must pass through J levels 
(level J-1 through level 1) before reaching the 
accumulating ME. This requires an additional 
time of Jt. When MEQ, Q receives the 
product it must multiply it by the partial 
product in its B resister and output the result 
taking time t, thus the total time delay of the 
procedure is. 


(Fuy231 + Ji t 


We stated previously that such a design 
could yield a need of (logy (logox)) . 
This is true if 2 N. This is the lower 
bound on the computation of @* in GF(2N) 
using this device. In this case no accumulating 
ME is needed saving one delay. 
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Abstract -- A fully decentralized operating 
system is specified for a single user multiple 
computer environment. Most operating systems 
have been designed around the architecture of a 
specific machine. We propose to design the oper- 
ating system in a modular fashion specifying the 
needed hardware architecture as the operating sys- 
tem evolves. Each module is assigned to its own 
processing element and these communicate when 
necessary via a message passing scheme. Process 
swapping is not necessary since multi-processing 
does not take place on the computing element 
level. Although connected to a local network most 
work is done at the local user station and there 
is no dynamic load. balancing at the network level. 
A distributed design leads to small and simple 
operating system modules. By distributing func- 
tions to independent processors protection is 
greatly simplified and the inherent concurrency 
gained improves performance. 


INTRODUCTION 


The decreasing size and cost of integrated 
circuitry suggests a new direction in the develop- 
ment of computer systems. It is quite reasonable 
to expect the current state of multiple users per 
processor to be totally reversed: one user 
will have an array of processors at his disposal. 
This paper specifies a multiple processor archi- 
tecture and its accompanying distributed operating 
system for a single user environment. 

To date operating systems have been designed 
around the architecture of a specific machine. In 
contrast, we propose to first design the operating 
system in a modular fashion. The overall archi- 
tecture of the desired machine will then evolve as 
the design of the operating system evolves and 
will be specified to meet its needs. This pro- 
posal is based on the everpresent reality that 
hardware is no longer an expensive resource, and 
need not place the traditional constraints on 
system design. The phrase "Island Universe" is 
used to suggest the extensive processing power 
available to an isolated, single user under this 
design. 


Network Levels 


We view network activity on three levels. 
On the large scale, there is a remote network 
which links geographically distant sites, pro- 
viding potentially global communication and spe- 
cialized services. The next level, the local net- 
work, is of the Ethernet variety (1), and operates 
in the niche traditionally occupied by a central- 
ized, time-sharing system. On this level are a 
variety of basically independent users operating 
out of their own stations, with occasional cooper- 
ation on specific tasks. In general, we feel a 
user should complete work based on resources at 
the user's station and not arbitrarily send work 
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out to other stations. This prevents local per- 
formance degradation due to the load of others, as 
well as providing a measure of protection between 
users. However, we still recognize isolated in- 
stances of network use beyond that of a mail 
service, but the local resources of an individual 
user are considered sacrosanct. Central to the 
design of the "Island Universe" environment is the 
third level of network, the user station itself. 
The components of the distributed operating system 
constitute a miniature network of cooperating 
processes within the user's station and is the 
topic of this paper. 


THE DISTRIBUTED OPERATING SYSTEM 


We propose to partition the operating system 
into cooperating modules, similar in concept to 
task force utilities in Medusa (2), each with 
specific functions mapped onto physically separate 
hardware components. This greatly simplifies the 
overall complexity and protection needs. Each 
component has a small resident nucleus and a soft- 
ware process to perform the implied operating sys- 
tem function. A major function of each nucleus is 
concerned with message passing and is therefore 
reminiscent of other nuclei (3,4), but is even 
less complex than these examples. A diagram of 
the interconnections between these components is 
shown in Figure 1. Each hardware component con- 
sists primarily of a processor and memory to hold 
its assigned software process. Some modules have 
a natural association with certain devices such as 
the User Interface with the display terminal and 
the File System Manager with secondary storage. 
Each module is partitioned in such a way as to 
make its task quite simple. We propose to elimi- 
nate many of the complexities brought on by 
resource sharing. 


Software Modules 


There are two types of software modules, sys- 
tem and user processes. System processes are 
statically assigned to the User Interface(UI), Task 
Scheduler(TS), File System Manager(FS), Device Con- 
troller(DC), External Communication Controller (ECC) 
and the Inter-process Communication Manager(IPC). 
These processes are resident at all times and 
because of their limited task are limited in size. 
On the other hand, user processes are run in the 
Processor Array(PA), a collection of identical 
processors. Because of the inherent unpredicta- 
bility and varying needs of user processes, these 
processors need more memory and perhaps more 
speed. All processes cooperate con common tasks 
via a message passing system. Messages between 
system processes or between system and user pro- 
cesses take place on the Service Bus while those 
between user processes take place on the Inter- 
process Communication Bus. The function of each 


system process will now be briefly described. 

The UI handles all interaction with the user 
and essentially acts as a command line interpreter 
and, possibly, an editor. The DC manages all use 
of devices not associated with the user display 
terminal or the on-line file system and caters to 
specific device idiosynchrasies. The ECC performs 
all external networking functions for the station 
and, thus, is concerned with protocol as well as 
security matters. The FS manages access to the 
file system and maintains a local file cache in 
primary memory as well as the file system itself 
on secondary storage. The TS and the IPC both 
manage and support the needs of user processes 
running in the PA. The TS acts as a user process 
manager while the IPC manages the communication. 
Each process is assigned dynamically to a proces- 
sor in the array and is allowed to block as well 
as run to completion without swapping. A 
"process cache" is maintained to re-use processes 
already in the PA without having to reload them. 

Communication takes place between these 
modules in two ways, messages and data. As pre- 
viously mentioned, requests for system activity 
takes place in the form of messages on the Service 
bus. Similarly, communication needs between user 
processes take place as messages, but on the IPC 
bus. Messages are short transmissions, typically 
of fixed length, occurring quite often. Data 
communication occurs less often but involves much 
longer transmissions. For example, this might be 
the loading of a user program to a processor in 
the PA from the file system. While this form 
occurs less often, we would not want to dominate 
the Service bus for the considerable amount of 
time it would reauire. Hence, this creates the 
need for the separate Data bus. Although these 
connections are described as buses, we do not rule 
out the presence of dedicated links for high den- 
sity traffic. 


ADVANTAGES 


The decreasing cost of processor and memory 
resources has made possible experimentation in the 
distribution of operating system functions. We 
believe that division of the operating system into 
distinct physical subsystems offers many advan- 
tages in terms of simplicity, efficiency, protec- 
tion, and security. Of course, we also expect an 
improvement in performance due to the concurrency 
in such a system. 

From the perspective of the working environ- 
ment implied by this architecture, there are 
several obvious advantages. The isolation of the 
user from centralized control increases both re- 
sponsiveness and security. The paradigm of a 
local network of autonomous stations (with commu- 
nications capability) more accurately reflects 
the work habits of most computer users than does 
that of a centralized time-sharing system. 

Also, the system has many advantages stem- 
ming from its internal organization. The oper- 
ating system and user processes are all totally 
distributed, allowing significantly faster re- 
sponse (resulting from the parallelism), and 
additional security (resulting from autonomous 
nature of the individual processors). 
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Many traditional problems of operating sys- 
tem design disappear in this architecture. There 
is no need for memory management, CPU scheduling 
(in the timeslice sense), or inter-user security. 
The operating system itself is also inherently 
secure from user intrusions. The modularization. 
realized by the separation of the components sim- 
plifies the programming of the individual func- 
tions. Evidence of this is seen in many examples 
of multi-users time sharing systems where multi- 
layered tables are needed to implement sharing 
and allow processes to be swapped out. 

The message-based nature of the system 
yields a measure of protection. Furthermore it 
is a simpler problem because it is defensive in 
nature and, since each computing module has only 
a nucleus and one process running, there is 
little need for extensive hardware protection 
mechanisms. 


CONCLUSION 


Implicit in our design is the assumption of 
the availability of inexpensive resources. The 
number of processing elements may seem over- 
whelming at first glance, but advances in VLSI 
technology will allow fabrication of systems in 
a fraction of the space of current centralized 
computing systems in the very near future. 
Current technology allows line widths of 1 micro- 
meter. Use of X-rays will make line widths of 
.1 micrometers possible (5). These improvements 
point to densities that will allow several times 
the number of devices per chip than are now 
possible. The modules of our proposed system 
might well be put on just a few chips, and this 
will reduce cost and improve reliability. In 
particular, the processor array appears to be 
appropriate for high density fabrication tech- 
niques. In the next few years, micro-computers 
as complex as the PDP-11/34, complete with 
processor, memory, and I/O interfaces will be 
available on a single chip (6). The recently 
announced multimicroprocessor chip, Texas 
Instruments' RIC (7) merely reinforces the prac- 
ticality of this view for the future. 

One of the major goals of this work is to 
provide an environment which supports execution 
of true concurrent algorithms in the system's 
Processing Array. Furthermore, it is important 
that this capability be designed into the system 
from the beginning. The operating system should 
provide a set of tools which aid in the speci- 
fication or parallel programs and it is our 
intent to do so. 

Currently, work is in progress to model this 
system. Weaknesses can be observed, in this 
manner, to aid in the final specification of the 
design. Of particular interest is usage pattern 
of the communication paths. We hope to begin 
the development of a prototype system within the 
next year. Only through the construction of such 
a system can we witness the benefits in terms 
of complexity of system design, performance and 
protection. 


(1) 


(2) 


R.M. Metcalfe, and D.R. Boggs, "Ethernet: 
Distributed Packet Switching for Local 
Computer Networks", Comm. of the ACM, 
(July, 1976), pp. 395-404. 


J.K. Ousterhout, D.A. Scelza, and P.S. 
Sindhu, "Medusa: An Experiment in Distri- 
puted Operating System Structure", Comm. of 
the ACM, (Feb., 1980), pp. 92-105. 


P.B. Hansen, "The Nucleus of a Multipro- 
gramming System", Comm. of the ACM, (April, 
1970), pp. 238-250. 


J. Hoppe, "A Simple Nucleus Written in 
Modula-2: <A Case Study", Software-Practice 
and Experience, vol. 10, 1980, pp. 697-706. 


(5) 


(6) 


Se. 


Data Bus 


B. Fay, et. al., "X-Ray Replication of Masks 
Using the Synchrotron Radiation Produced by 


the ACO Storage Ring", App. Phys. Lett., 
(September, 1976), pp. 370-372. 


L. Wittie, et. al., "MICROS, A Distributed 
Operating System for MICRONET, A Recon- 

figurable Network Computer", IEEE Trans. on 
Computers, (December, 1980), pp. 1133-1144. 


R. Budzinski, J. Linn, and S. Thatte, 

"A Restructurable Integrated Circuit for 
Implementing Programmable Digital Systems", 
Computer, (March, 1982), pp. 43-54. 


Service Bus 


Local 
Network 


Figure 1 - Single-user distributed OS: 
Subsystems and Interconnections 


A VARIED STRATEGY PROGRAMMABLE ARBITER 


M. COURVOISIER 
L.A.A.S.-C.N.R.S. 
Université Paul Sabatier 


7, 


ABSTRACT 


The use of arbiters can be very efficient in 
shared bus multimicroprocessor structures. As these 
structures become more and more complex the use of 
arbiters having very sophisticated arbitration ru- 
les is needed. Presently most of the arbiters which 
have been studied are based on two simple arbitra- 
tion rules : linear or circular (one only is cycli- 
cal and allows mixed priority schemes|5]}) and no 
systematic design rules exist. 


In this paper we present a contribution to the 
systematic design of arbiters having complex arbi- 
tration strategies based on three rules : linear, 
circular and cyclical. 


The basic structure corresponds to a modular 
synchronous arbiter and the problem consists in de- 
signing the decision part as a state machine whose 
construction is obtained by three successive rules. 


INTRODUCTION 


Local parallel shared bus structures require 
fast access procedures to the bus. Among the diffe- 
rent possible techniques [1] selection techniques 
are the most efficient in this case. As distributed 
structures become more and more complicated, the de- 
finition of effective priority rules of access is 
needed. This can be obtained by using centralized 
arbiters able to implement varied arbitration rules. 


At present some authors have proposed different 
structures of arbiters | 2,3,4,5,6] among which 
synchronous ones are well suited to shared bus multi- 
microprocessors systems. Nevertheless the arbitra- 
tion schemes are for the most part very simple : 
linear or circular and no systematic construction 
rule is given. 


The aim of this paper is to define systematic 
construction rules for the decision block of a 
synchronous arbiter previously proposed [6]. By 
using three operators corresponding to elementary 
allocation strategies : linear, circular, cyclical, 
assembling of these operators can lead to very so- 
phisticated arbitration decisions. The rules propo- 
sed lead to the progressive construction of the 
decision block whatever its complexity can be. 


BACKGROUND 


The signaling convention uses the request- 
grant mode. 


The structure of the arbiter proposed in [6,7] 
is made up of five blocks (Figure 1) : input, detec- 
tion and end of requests, decision, sequencing, 
output. 


The meaning of the signals is as follows : 
{Ri} = request lines, {Gj} : grant lines, §,: detec- 
tion of requests, ©: detection of an end of request, 
LIR : load input register, DEC : decision, LOR : lead 
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output register, COR : 
clock pulse. 


clear output register, CP : 


The transitions of the arbiter are controlled 
by the sequencing block whose state diagram is given 
in Figure 2. 


DEFINITION OF ARBITRATION RULES 


After having loaded the input requests the arbi- 
ter must select one of them according to the arbitra- 
tion rule chosen. The arbitration. rule is programmed 
in the decision block of the arbiter as a state 
machine. 


In this paper, we propose to use and combine 
three arbitration rules : 

L linear (1L2L...LN) represents a strict priority 
between users 1,2, and N decreasing from 1 to N. 

R yxround robin or circular (1R2R...RN) represents a 
fair allocation strategy. If user K has been ser- 
ved, user K+1 has priority on all other users. 

C cyclical (1C2C...CN) represents also a fair allo- 
cation strategy in which a user is served accor- 
ding to all the previous services granted by the 
arbiter. For instance consider 1C2C3 ; suppose 
that user 2 has been served, followed by user i 
and that users 2 and 3 are simultaneously reques- 
ting. A circular strategy serves user 2 whereds 
a cyclical strategy serves user 3. 


Figure 3 is an example of the decision block for 
a circular strategy and a 4-user arbiter. 


The combination of these rules is performed by 
using brackets. 
Example : ((1L2L3)R(4R5)) is an arbiter which gives 
a circular priority between the block (112L3) and 
the block (4R5). In the block (1L2L3) the priority 
is linear ; it is circular in the block (4R5). 


The construction of a strategy from elementary 
ones allows to define very sophisticated arbiters 
according to the requirements of the multimicropro- 
cessor structures in which they must be used. 


state ma- 
can be 


The problem consists in designing the 
chine of the decision block and this task 
very tedious if the strategy is complex. For ins- 
tance, as will be shown in the next part, the deci- 
sion block of an arbiter with strategy ((1R2) R(3R4) 
R(5R6)) is a 32 states machine with 160 labelled 
arcs. Our contribution consists in defining rules 
for the systematic construction of decision blocks 
using any block based combinations of L, R and C 
strategies. 


DEFINITION OF CONSTRUCTION RULES 


The systematic construction of the state machine 
of the decision block of an arbiter is carried out 
in three steps : 

1. Determination of the states 
2. Determination of the arcs 
3. Labelling of the arcs 


322 


(Lenght limitations of this paper imply that cons- 
truction rules are given without proofs). 


Determination of the states 

Let us consider the three basic cases : 
(1L2...LN), (1R2...RN) and (1C2...CN), which are 
fully linear, circular and cyclical strategies res- 
pectively and let us call them L-block, R-block and 
C-block. The L and R blocks are representable as n 
states machines each state i being associated with 
user i, whereas the C block is represented as an! 
states machine because (n-1)! states must be asso- 
ciated with each user i in order to keep track of 
the past services (represented by the permutations 
on n-1 users) consider now the following arbiters 
in which U is also a user and S can be any of LR 
or C strategies. 
((U)S(1L2...LN) ) 


((U)S(1R2...RN)) ((U)S(1C2...CN)) 


In the first case, when U has been served, the 
next user to serve is the highest priority user re- 
questing in the second block regardless of the past. 
In the second case the next user to serve in the se- 
cond block must be determined according to the posi- 
tion of the last user served on the priority ring. 
In the third case all the past of the second block 
(all the possible permutations between users) must 
be memorized. Consequently in the first case one 
state is sufficient for (U) whereas n andn! are 
necessary in the second and third cases, respective- 
ly. This leads to the following definition : 
Definition : The multiplicity of a block Bj is a 
number M; which gives the number of times states of 
other blocks B. at the same level of the factorized 
expression of the strategy must be repeated. This 
is to keep memory of the state of block Bj when it 
is left. 

Example : Let 1,2,...N be n users. 

The multiplicity of (1L2...LN) is 1 
The multiplicity of (1R2...RN) is n 
The multiplicity of (1C2...CN) is n! 


A one user block may be considered as being of 
any LR or C type because its multiplicity is always 1. 


In case of embedded blocks the calculation of 
the number of states of the upper block which is in 
fact the complete decision block itself must be made 
according to the subblocks which constitute it. 

Then two parameters are necessary to caracterize a 
block : - the number of its states NS 
- its multiplicity M. 


It can be shown that these parameters are obtai- 
ned by recursive formulas, which at a given level 
of the factorized expression are : 


B e B 
ns (6) = NS, ay T) x(t 1) if the levei C 
b=1 i=l operator is Lor R 
ifb 
t B ine ae i 
or ns ' ) a (B-1)! ? nsf re ie | wil if the ieve1 £ 
b=} i=] operator is ¢ 
i#b 
B 
mae = f ) mtb) if the level { operator is L 
i=l 7 
B 
or mas = B. | aed if the level L operator is R 
i=1 
B 
or mae =B! [? y(n if the level t operator is C 


with B : number of subblocks of the considered level. 


The application of these formules starts from 
the leveliblocks. Once the upper block is reached, 
the number of states which is necessary for each 
user is known. 

Examples : * (((1R2) R(3R4) ) R(5R6) ) 

NS=32 users 1,2,3 and 4 need 4 states 

users 5 and 6 need 8 states 

M=32 

* ((1C2C3) L4) 
users 1,2 and 3 need 2 states 
user 4 need 6 states 


NS=12 


M=6 
REMARK : 
- the number of states of an user fully linear 
block is n, 
- the maximal number of states of an user fully 
cireular block is 2971, 
- the maximal number of states of n user fully cy- 
clical block is n! 
A fully L(or R(or C)) block is a block made up of 
L(orR(orC)) subblocks only. 


Determination of the arcs 

Each state assigned to one user has (n-1) out- 
going arcs because any other user (n users) may be 
served after it. Consequently, the total number of 
arcs is NS. (n-1). The determination of the destina- 
tion of these arcs can be made systematic if an 
indexing of the states pointing out the origin of 
their multiplicity is given. 
Indexing states 

~The following rule is applicable. 

Rule 1. Indexing of states is applicable from the 
lowest block level. Each state is indexed relatively 
to the blocks which cause its multiplicity to increa- 
se, by the names of the users the memory of which 
must be kept. 
Example : (((1R2) R(3R4)) R(5L6) ) 
gives : 1 2 3 4 


8 states are necessary for user 5¢the multiplicity 
of ((1R2)R(3R4)) is 8) because after 5 has been ser- 
ved it must be possible to decide which one of 

users 1,2,3 and 4 must be served according to the 
priority rules. 543 is an abbreviated fromof 5 pre- 
ceded by 1 preceded by 3. 


which comes out ix to one of the states associated 
with user j. Determining which one of the j states 
is reached by ig needs the following consideration : 
the past of the system which was kept in state Let 
must also be kept in state Ip in order to agree with 
the arbitration stategy rule. Consequently the rule 
below holds : 

Rule 2. An arc connects ia to j pif contains sub- 
scripts belonging to {{o<, i} -3j} in the same order 

as ig if the order is significant. 

Example : ((1C2)C3C4) Each state has 3 outgoing arcs. 
The outgoing arcs of state 13 are directed to 23 
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Labelling the arcs 
This is the last step in the construction of the 
state machine of the decision block of an arbiter. 
Apply the following procedure : 
Step 1. Each outgoing arc of a state receives as la- 
bel the name of the user corresponding to the state 
it is directed to. 
Step 2. The labels of the arcs originating from the 
Same state must be made exclusive according to the 
strategy of the arbiter. Let ian be this state ana 
its block level. Apply iteratively from the higher 
block level kmax to level 
2.1. Let k be the current level and u the position 
of the block which contains ig in that level. Consi- 
der an Awan destination states belongs to a 
block B{ K) gp (k) (p(k) 


in pest tice 4). Let b‘K) pe the number of blocks at 
level k. 
; (k) 
oie ( )+.eL -(C do dp) ee Lee ( B yeeedy sr ee 
R %) R 
or Cc B or C 


u 
a). The level k operator is L 
Arcs are labelled by the product of complemented 
users names at the left of B{*’. Exclusion between 
these labels is performed according to the priori- 
ties of the corresponding users in the block B 
with respect to its operator type (L Ror C). 
the operator is L the priority is from left to 
right ; if the operator is R the priority is given 
to users whose name is not in & according to their 
order in the block ; if the operator is C the prio- 
rity is determined according to the order of the 
users of the block pik in the subscript ch. 


b). The level k operator is R. Two cases have to be 
distinguished : * j>u . The arcs are labelled by 
the product of complemented users' names which are 
in blocks BE) to Bin . Exclusion between them is 
performed like in a. 

* j€u . The arcs are labelled by 
the product Be ara users' names which are 
in blocks pt and B{*) to Bik) - Ex¢lusion 
between them ig pers ea like in a. 


c). The level k operator is C. Each arc is labelled 
by the product of complemented users' names 

* which are not ine and not in B3: 

* which are in at the right of the name of the 
user to which the arc ends in B 

2.2. Tet k be the current level. Consider arcs be- 


longing to Bu 
(YB r 


a). The level k <i is L. These arcs are label- 
led by the product of complemented users' names at 
the left of BI). 

b). The level k operator is R or C. These arcs are 
labelled by the product of complemented users' names 
which are not in B 


bite ee DnyBe Ode Dhemen 


Example : (1C(2R(3C4) )) NS = 8; M= 8 
Indexed states : 1,, 13 toa ta 
24 2, 
; 4 
Outgoing arcs (24) 
4 
23-> 23 3 4 2, —> 1,, 3 4 
1, —» 2, 3 4 2,—» 1,, 3 4 
tog 24 3 4 ae 2 4 
1) —> *4 3 4 4 my 1, 3 4 
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stands for the kth. level block 


Labels : a S 4 : step 1 
ia ae ae 
4 
27.3 
1. nt 2 3 4 : step 2.1.c 


»..etc (Figure 4) 
IMPLEMENTATION 


The structure of the arbiter is modular and 4 
modules are independant (for a given number n of 
users) of the strategy employed. The implementation 
of a given strategy can be obtained by programming 
the decision module constructed as shown in Figure 
5. The PLA, and PLA, contain the equations of the 
excitation variables and of the output variables 
respectively corresponding to the state machine of 
the decision block designed by the above procedure. 
By using a state variable register of lenght 

Logj(n!) bits any n user strategy can be imple- 
mented by programming the two PLASonly. 


By using off-the-shelf TTL-S circuits and 825100 
FPLAS for the implementation a response time of 
150 ns is obtainable (CP=50 ns). 


In this paper, a method for the systematic im- 
plementation of complex strategy arbiters has been 
presented. It is based on the use of a modular syn- 
chronous arbiter. Further studies include the defi- 
nition of multistrategy arbiters and cascadable 
arbiters based on the same principle. 
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Abstract 


In the context of a multiuser multiprocessor 
system with private cache, we consider the write 
through versus the write back policy of main 
memory update. The write back policy has the 
advantage that the bus traffic is reduced compared 
to the write through policy. It is usually assumed 
that the coherence problems of write back require 
hardware such as global directories to detect 
potential coherence problems. For this reason a 
write through cache is usually used which provides 
coherence for all transactions. 


In this paper we suggest ways to avoid coher- 
ence problems altogether in user code, and examine 
the potential savings due to being able to use a 
write back rather than a write through cache, in 
terms of bus traffic. Using a detailed instruc- 
tion level simulation it was found that in the 
typical case the write back policy will allow 
greater than double the number of processors on 
the bus at a given traffic level, compared to 
write through. 


I. Introduction 


The shared bus approach to multiprocessing is 
very attractive since it is simple to implement 
and easy to use in a multiuser timesharing 
environment. Standard busses such as the Multibus, 
Versabus, and S-100 bus all have provision for 
multiple processors on the bus [1,2]. Larger com- 
puter systems such as the VAX 11-780 have been 
converted for shared bus multiprocessor operation 
[9]. Modern operating systems which are process 
based (VMS, UNIX, AOS) are particularly well 
suited to such an environment [3,7]. 


The obvious disadvantage of the shared bus 
approach is that the bus (and memory), being the 
only shared resource, is a bottleneck. In [9] a 
dual processor VAX is described and it is reported 
that bus saturation occurs somewhere between 2 and 
3 processors. Providing multiple paths to memory 
which can be switched to allow concurrent access 
by multiple processors eases this problem, and a 
great deal has been written on this approach 
[5,6]. In the case of relatively few processors, 
however, it is convenient to avoid the complexity 
of cross bar or delta switches and attempt to con- 
nect the processors on a single bus. One can then 
consider connecting this substructure to others 
through various switching networks, as in Cm* 


[11]. 
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In a shared bus system the number of proces- 
sors which can be supported depends on the bus 
bandwidth available, hence it is important to con- 
sider ways of reducing the traffic on the (single) 
memory bus. One method is to use a private cache 
for each processor. The effectiveness of cache 
memories in improving performance of computer sys- 
tems is well known [12,13]. The most obvious 
advantage of the cache is the reduced access’ time 
for cache relative to that of main memory. The 
use of a private cache can also reduce traffic on 
the memory bus. While this is of secondary 
interest in uniprocessor systems, it is of criti- 
cal importance in bus coupled shared memory mul- 
tiprocessor systems. 


In this paper we consider methods of cache 
organization which offer reduced bus traffic com- 
pared to the methods commonly used in existing 
uniprocessor designs. We consider a general pur- 
pose time sharing system for which a detailed bus 
transaction level simulation has been constructed. 
Under realistic assumptions, we find that the bus 
traffic can be reduced in the typical case by a 
factor of greater than 2, and for some systems by 
a factor of greater than 8, by employing these 
techniques. This surprising result directly 
translates to having more than twice as many pro- 
cessors in the system at a given level of bus 
saturation. These techniques deal mainly with the 
update policy of main memory. This improvement is 
made with no degradation of common or desirable 
operating system functionality. In particular, 
neither interprocess communication nor symmetric 
multiprocessing are precluded. 


IIT. Cache Coherence and Potential Gain due to 
Writeback 


The two major categories of cache organiza- 
tion, shared and private, are shown in figures 1 
and 2 respectively. Examples of these structures 
(in the uniprocessor case) are commercially avail- 
able. The VAX 11-780 from Dec uses private cache 
while the Data General MV/8000 uses a shared cache 
[4,11]. The ATU (address translation unit) is 
shown between the CPU and the cache, indicating 
that the virtual addresses which are issued by the 
CPU are translated into physical addresses which 
index the cache. There are advantages to placing 
the ATU between the cache and the system bus (the 
cache is then indexed by virtual addresses) and 
this organization is under’ study now. For the 
remainder of this paper we will assume the usual 
ease of translating the addresses before indexing 
the cache, as illustrated in Figures 1 and 2. 


As in any hierarchical memory system the 
question of coherence among multiple copies of 
logically identical data items (e.g. a cached item 
and its copy in memory) must be resolved [10]. 
The shared cache in figure 2 has no. coherence 


problem, since there is no device that modifies 
the memory without going through the cache. This 
allows the use of a write back rather than a write 
through policy for main memory update. This is the 
organization used in the MV/8000. 


In the private cache structure of figure 1 
there is potential for cache coherence problems 
even in the uniprocessor case, since DMA I/O can 
modify cached data. In the VAX 11-780, which uses 
this structure (with a single CPU) the cache moni- 
tors the bus for writes to locations that it has 
cached. When it detects one it marks the 
corresponding cache slot empty so that the next 
access will be forced to read the modified value 
from memory. Writes to cache can be immediately 
passed on to main memory, and the memory system is 
able to queue write requests, so that the proces- 
sor can continue without waiting for the write to 
complete. Read access to the cache, of course, 
requires no transaction on the system bus, hence 
the private cache saves bus traffic over the 
shared bus cache. 


There is a third alternative, namely using a 
write back policy in a shared cache. Since cached 
values are written only on a cache fault that 
requires a replacement into memory, or at context 
switch time, cached values may be modified in 
cache more times than they are written to memory. 
This represents a potential savings in bus traffic 
over the write through case. In addition, the 
complexity of queued writes to main memory can be 
avoided. The disadvantage is that a "modified" 
bit must be maintained in the cache and at context 
switch time any modified words (or blocks) must be 
written out. This causes bus traffic to be related 
to context switch rate. The main problem with 
write back is the coherence problem, which we will 
consider now. 


A user process as modeled in figure 3, exe- 
cuting in a timesharing environment, will typi- 
cally do all of its I/O via system calls and in 
the usual case will be doing blocking I/0. Inter- 
process communication will also be done via system 
calls (as opposed to directly writing shared 
memory). It seems then that user code need not 
worry about coherence, so that any write through 
operation from a user process’ represents an 
unneeded bus transaction. This is the motivation 
for considering how much traffic is used for write 
through, and whether it can be avoided. 


In a process based operating system, which is 
typical of what is run on the systems considered 
here, a process can be blocked, ready, or running. 
We will assume that a ready process can execute on 
any of the processors in the system, and that when 
a processor is ready to run a process, that pro- 
cess is taken from a central queue in an atomic 
operation which is not susceptible to races among 
processors. This is not a difficult objective to 
achieve, and it allows the operating system to be 
largely independent of the number of processors 
that are connected to the bus. Figure 3 illus- 
trates the major states that a process can be in, 
and some conditions under which transitions occur. 
This simple model is not at all unrealistic for 
consideration of the execution phase of a process. 
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We will ignore process initiation and termination, 


since these are boundary conditions during which 
operating system control of cache can be assumed. 

Hence, if the following three rules are 
observed, we can at least ignore the coherence 
problem for nonsystem code. 

(1) The process does no I/O itself. This does 
not restrict the operating system from ini- 
tiating DMA I/O into the process’ address 
space. 

(2) When a process is in the blocked or ready 


state, there are no values from the process 
address space in the cache. 


(3) When the process communicates with another 
process, it does so via a system call, as 
opposed to (for example) writing into physi- 
cal memory that the receiving process is 


expecting to use for communication. 


While it is beyond the scope of this paper to 
treat operating system implementations relative to 
cache policy, we have considered the  problen. 
Suffice it to say that these rules do not preclude 
services such as nonblocking I/0, multiple event 
wait, and interprocess communication, which we 
feel are essential in any multiprocessing system. 
Having confined the coherence problem to. the 
operating system we appeal to the fact that’ the 
system can be aware of when coherence problems can 
arise. In the case of interprocess communication 
it is possible to implement message passing by 
mapping a block of memory into the receiver's 
address’ space. Since the pages were previously 
unmapped (not in the receiver's address’ space), 
they certainly are not in cache, so there is no 
coherence problen. A less elegant alternative 
which has been used in DEC-10 dual processor sys- 
tems under SMP (symmetric multiprocessing) is for 
senders to cause a cache flush in the receiver's 
cache. There are more intricate hardware solu- 
tions in the literature as well [10]. 


In the later sections we will consider’ the 
amount of bus traffic that can be saved by using a 
write back cache in various system configurations. 
In what follows we assume that the problem of 
coherence is dealt with as suggested above. We 
now discuss the simulation system used. 


III. Simulation System 


There are many ways of analyzing a complex 
System such as the one in Figure 1. These range 
from the stochastic approach of characterizing a 
system in terms of a small number of statistical 
parameters to the empirical investigation of a 
realization of the system. We have chosen to 
simulate the system at a fairly low level; i.e. 
instruction timing and bus conflict behavior are 
faithfully replicated, but those phases of opera- 
tion which are not directly of interest relative 
to cache performance, such as instruction decode 
details, are not included. The simulation will 
accurately reflect, for example, alternating bus 
access by the processors. The system is driven by 
execution of target system code. 


The amount of concurrency inherent in this 
system precludes the exclusive use of a standard 


sequential programming language. We use the C 
programming language to express’ the sequential 
parts of the target system. To extend the 


language for the simulation of highly concurrent 
systems, we have constructed a simulation environ- 
ment which provides for process creation, termina- 
tion, synchronization, and communication. This 
allows a very natural expression of the semantics 
of a digital system since typical hardware systems 
can be accurately viewed as a collection of 


processes. In our example we have only 3 
processes. These are the bus process and two 
processes to realize the processors. Figure 4 
illustrates the structure of the system. The ker- 


nel portion is written in assembly language since 
it needs to be able to maintain multiple data seg- 
ments for the processes. The utilities are writ- 
ten in C, as are the simulation modules CPUO, 
CPU1, and BUS. 


Part of the benefit of using the simulation 
environment is that code can be shared. For exam- 
ple, there is only one copy of the code which 


implements the CPU element, and there are two 
independent processes that execute this code. In 
this sense the simulation system is modular. To 


increase the number of processors on the bus we 
merely invoke a third copy of the CPU process by 
changing two lines of code in the simulation. 
Changing the design of the target in this fashion 
is quite simple and this allows the designer to 
evaluate several different system configurations 
in a matter of a few hours. 


The most important functional capability pro- 
vided by the simulation system is the ability to 
manage several processes in a single address 
space. Management includes the following. 


Process creation and termination 
Processes can be created and terminated 
dynamically. 


Priority scheduling of processes 
Processes can be initiated at any of 4 
priority levels, with all processes at a 
given priority executing in a round robin 
fashion. . 


Maintenance of a sleep queue 

Processes typically indicate that they are 
going to incur a time delay by calling 
sleep. A memory module would for example 
issue the call "sleep(450)" to indicate that 
a memory access requires 450 ns. Other 
processes will continue executing during 
this 450 ns period if possible. 


Signal, wait, and semaphores 
These are provided for interprocess communi- 
cation and synchronization. 
The. simulation system implements the target as a 
collection of processes that run in the = single 
address space of a UNIX process. This feature is 
critical to good performance. We incur a penalty 
of only about 30 instructions for signal and wait, 
since there is no need to call the operating sys- 
tem to communicate with other processes, in con- 
trast to other simulation systems. [14]. 
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IV. 


Within the context of a system such as_ that 
in Figure 1, there are many parameters which can 
be varied without violating the basic structure. 
We consider the following. 


(1) Cache policy: write back vs. write through 

(2) Length of time slice. 

(3) Timing parameters such as bus_ speed and 
behavior with and without cache. 

(4) Cache block size. 


Our main result deals with the amount of bus 
traffic that can be saved by using a write back 
rather than a write through policy. For the write 
back case, the length of time that a process runs 
without a context switch (and attendant write 
back) is also examined. We refer to this time as 
the timeslice. 


As we have mentioned, the simulator used here 
is a low level deterministic simulator. To evalu- 
ate a given design parameter, a test program is 
run on the (simulated) target. In our case a com- 
piler for a simple variant of Pascal was written 
to allow reliable and convenient generation of 
nontrivial target programs. This language was 
chosen for convenience since some support software 
for its execution was already in existence. The 
hypothetical target processor was chosen because 
it is architecturally interesting and straightfor- 
ward to implement. There is nothing inherent in 
our analysis technique that precludes evaluation 
of existing real machines. We have in fact done 
so in evaluating a similar system incorporating 
PDP-11 processors, and an effort to compare the 
current system to a similar structure using the 
Motorola 68000 processor is under way. Since we 
are concerned here mainly with the issue of bus 
traffic and cache effects, the detailed issues of 
the processor architecture are largely irrelevant, 
as long as_ the processor used is similar in its 
address reference behavior to conventional 
machines. Our simulation has been so designed. 


To illustrate the role cache plays in reduc- 
ing bus traffic, a simple program was run on the 
system and bus speeds were varied. The program 
sorts elements in a matrix by calling several sub- 
routines. This program was used because, in con- 
trast to the Gaussian elimination program used 
later, it depends heavily on subroutines. 


Figure 5 is a plot of average bus utilization 


versus bus’ speed for a two processor system with 
private cache. The higher curve is for the case 
that the cache is turned off completely. Note 
that saturation occurs at a much slower bus’_ speed 


for the no cache case, indicating that the cache 
is effective in keeping tthe processors off the 
bus. It should be noted that the statistic given 
(bus utilization) is not a useful measure of per- 
formance since the amount of time a processor 
spends waiting due to conflicts is not indicated. 
It is not difficult to obtain conflict statistics 
from the simulator, but for this study it suffices 
to examine the execution time of the various con- 
figurations (see Figure 6 for example). 


A more interesting example is shown in Fig- 
ures 5 and 6 which show traffic and execution time 
respectively as a function of timeslice. The 
analysis of the cache behavior has been carried 
out on several variations of the following confi- 
guration. 


(1) The cache is two way set associative and 
ean hold 1Kb. Both code and data are typi- 


cally cached, and the block size is 4 bytes. 
(2) 


The cache is private to the processor, one 


cache per processor as in Figure 1. 


(3) The memory is simplistic in that no requests 
are queued; if a word is written on a write 
through cycle the processor waits until that 


transaction is complete before proceeding. 


The bus is relinquished at the end of each 
cycle, so that if contention occurs, proces- 
sors will alternate bus cycles. 


(4) 


The code in this case is a 30 X 30 Gaussian 
elimination program. The cache size is 256 32 bit 
words. This is intentionally somewhat small com- 
pared to the size of the code plus data for the 
program, which totals about 5600 bytes. The hit 
rate for this program is typically 95%. 


The Gaussian elimination program does no I/0 
and is inherently free of subroutine calls. To 
simulate the effect of operation in a timesharing 
environment, a clock tick interrupt occufs at reg- 
ular intervals corresponding to a transition from 
the running to the ready state of Figure 3. It is 
assumed that on return to the running state the 
cache appears empty and has to be demand loaded. 
While this is not necessarily the best way to 
design a system, it is common practice. 


At context switch time the write back cache 
has to write any modified location into main 
memory. The write through cache has no such 
requirenment since written values have already 
been updated. Hence we expect that a very high 
context switch rate will cause the write back to 
suffer. As can be seen in Figures 5 and 6 this is 
indeed the case. 


However, even going to the extreme of a con- 
text switch every two thousand instructions, the 
write back strategy is superior by a factor of 
1.64 in terms of bus traffic and by a factor of 
1.62 in execution time, for this program. While 
the times ‘as shown in Figure 5 are accurate rela- 
tive to the assumed speed of the components of the 
system, they depend on both the speed of the pro- 
cessor relative to the cache and the times for the 
instructions executed. The bus traffic in Figure 
6 is independent of processor speed and we can 
State that assuming a context switch every eight 
thousand instructions we need to support 2.57 
times as many bus transactions if write through is 
used. 


On a large VAX system, the context switch 
rate under load is in the vicinity of sixty con- 
text switches per second [15]. With the timing 
used in our experiments this corresponds to about 
16000 instructions per timeslice, which will give 
an even greater advantage to the write back pol- 
icy. Furthermore, increasing the number of pro- 
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cessors for a given multiprogramming load will 
decrease the context switch rate, further reducing 
overhead bus traffic. In the limit, assuming 
processes run to completion, the write back policy 
requires fewer bus transactions by a factor of 8. 
If the processor architecture is primarily memory 
to memory, as in the Intel iAPX 432 [17], the need 
for cache to reduce bus traffic is even greater. 
Under these assumptions, and assuming run to com- 
pletion processing we find that the write through 
cache issues more bus transactions than the write 
back cache by a factor of 18 for this problem. 
This illustrates the need for careful cache design 
in such systems. 


We have also investigated smaller programs 
and different cache organizations. Figure 8 is a 
graph of execution time versus bus speed for. the 
matrix sorting problem. The cache in this case 
has a blocksize of 16 bytes, and we assume that 
any write back transaction must write a 16 byte 
block. Write through, of course, requires only a 
Single transaction. The average bus utilization 
in this case ranges from .36 to .59, even though 
the bus speed is 2.2 micro seconds, which 
corresponds to approximately twice the instruction 
time for this processor. When the processor speed 
was changed to 30ns for all instructions, with a 
100 ns cache, the bus utilization was still only 
-9 at a bus cycle time of 1.2 microseconds. Thus 
the adding a write back cache effectively more 
than doubles (700 ns versus 2000ns at the .55 
saturation level) bus bandwidth as measured by the 
average saturation. 


In Figure 9 the bus traffic for this program 
for the write through and write back case are 
plotted. The two are equal at a point well below 
realistic levels of context switch activity. As 
in the other cases, the write back policy is supe- 
rior for reasonable context switch rates, though 
in this case, the improvement is only a factor of 


1.3. 


Conclusions 


This study has shown that for a shared bus 
multiprocessor organization one can significantly 
increase the number of processors a given bus can 
support by using a write back rather than a write 
through policy for main memory update from cache. 


For small to medium size machines, for which 
the cost of a processor is small compared to the 
cost of the rest of the system, this is an espe- 
cially attractive means of improving multiuser 
performance without adversely affecting the effi- 
ciency or the functionality of the operating sys-~ 
tem. 


There are many more parameters and design 
tradeoffs that we have not considered here. 
Currently we are investigating the benefits of 
having a relatively wide path from the cache to 
main memory, and having the I/O devices communi- 
cate using this bus. If we assume that the I/0 
devices are reasonably intelligent it is possible 
to move a large part of the file system and 
operating system services into the device con- 
trollers (the file and I/O handlers) and into the 
bus interface (interprocess communication and syn- 


chronization). Current CPU chips are quite impres- 
sive in that they have reached the level of being 


viable for 


support of useful operating systems. 


VLSI techniques applied to considerations such as 
cache design and operating system support as 
described here will allow the construction of 
extensible systems that can be configured for an 
extremely wide range of performance while main- 
taining component commonality. 
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Abstract 


Coherence problem occurs in a multicache system 
when data inconsistency exists in the private 
caches and the main memory. Without an effective 
solution to the coherence problem, the effective- 
ness of a multicache system will be inherently 
Limited. These problems will be closely examined 
and treated in a systematic top-down manner. A 
new solution, LSCS (logical semi-critical section) 
scheme, in which the memory reference of a proces- 
sor is made as fast as possible, is proposed. 


Introduction 
The architectures of a multiprocessor computer 
C8,12] are primarily characterized by three attri- 
butes: (1) multiple, not highly specialized pro- 
cessors are used, (2) all processors share most, 
and often all, of the main memory, and (3) each of 
the processors is able to do computation individu- 
ally. Advantages offered by these attributes are 
so fruitful that these architectures will undoubt- 
edly play an important role in the computer of the 
future. 

The modularity and 
multiprocessor 


ce 


inherent in a 
offer the opportunity to 
build a more reliable system. In a_ carefully 
destgned computer, a failure of a single module 
does not crash the entire system, instead only a 
graceful degradation of the system's performance 
is anticipated. It is also true that in a *mul- 
tiprocessor computer the sharing of resources and 
processing power tends to smooth out effects due 
to random variations in workloads. In return, the 
throughput/cost ratto is tncreased. This occurs 
even if each processor in a multiprocessor comput- 
er performs worse than when it is in a uniproces- 
sor configuration (12). 

The economical advantage of a 
architecture  durtng the computer 
phase has also been noticed in (153. The regular- 
ity of a multiprocessor computer allows duplica- 
tion of modules of the same type. Both time and 
cost of design are thus reduced significantly. 
Furthermore, the designer of a multitprocessor com- 
puter may enjoy the freedom of choosing the most 
cost-effective uniprocessor element structure, in- 
dependent of the processing speed of the element. 

Due to physical Limitation imposed by the ex- 
isting technology, a untprocessor computer may not 
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be able to offer enough, or required, processing 
power. Thus, in spite of other advantages such as 
expandability, modifitability, etc., the construc- 
tion of a multiprocessor computer seems to be 
necessary. Concurrent execution of a number of 
tasks on different processors which aim at a sin- 
gle computation objective can reduce the overall 
computation time to a certain degree depending on 
the nature of the computation and the specific ar- 
chitecture of the computer. To date, all or some 
of these advantages have been clearly demonstrated 


to a certain extent by a number of experimental 
and commercially avatlable computers such as 
C.mmp, Cm*, PLURIBUS, S-1 Multiprocessor, IBM 


370/168, CDC Cyber 170, Honeywell 60/66, Burroughs 
B7700, and Tandom Nonstop. 

Although multiprocessor computers offer 
potential advantages, they also generate many 
problems. In particular, many people have’ Long 
believed that a multiprocessor computer composed 
of N processors always yields much less than N 
times the performance (throughput ) of the 
corresponding single processor computer due to the 
substantial memory interference and synchroniza- 
tton overhead. Multtprocessor computers’ are, 
thus, doomed to waste substantial resources, espe- 
cially when the number of processors is Large, say 
greater than four. 

The experimental data obtained from C.mmp [10] 
disproved the impression that synchronization 
overheads, also termed software lockout, is in- 
tolerably high in a multiprocessor computer. In 
fact, in the measurements involving 14 processors, 
idleness due to locking consumed Less than 1 per- 
cent of the processors. On the contrary, the cost 
of memory interference is indeed high. Roughly a 
factor of 3 in performance degradation has_ been 
observed in C.mmp £10] if all 16 processors exe- 
cute from a common memory. Consequently, the ma- 
jor threat to the performance of a multiprocessor 
computer is primarily due to the contention in the 
memory. 

Numerous studies aiming at reducing the memory 
interference have been performed ever since the 
proposal of multiprocessor computers. They can be 
roughly grouped into three categories: (1) solu- 
tions resort to the static or dynamic memory allo- 
cation strategies (4,6,9,13,171, (2) soluttons re- 
quire data tagged by specially designed operating 
systems [5], (3) solutions assume dynamic hardware 
support independent from software environments 
C1,6,11,18]. Among them, we believe that the fu- 
ture general purpose multiprocessor computers will 
fall into the third category especially for high- 
performance systems. The architectures with fewer 


many 


management problems and Less special software as- 
Sists will eventually dominate. 

In a computer of the third category, each pro- 
cessor is associated with a private cache by which 
a certain amount of information is trapped in and 
retained. As we know, in addition to the usually 
faster memory cycle time, a cache serves as a loo- 
kahead and lookbehind buffer. The Lookahead capa- 
bility of a cache may actually increase memory in- 
terference unless the additional words brought 
along with the missing word into the cache do not 
introduce extra fetches to the main memory. 
Moreover, the cache capacity has to be Large 
enough to insure that the utility of the lookahead 
is no Less than that of the lookbehind. Neverthe- 
less, a system with Less memory interference does 
not necessarily result in a better performance. 
Good performance is a result of a balance between 
the degree of memory interference and the _ cache 
hit ratio. This subject has been studied in (18). 
On the other hand, the information retained in a 
cache, termed Lookbehind capability, usually has a 
high possibility to be reused several times before 
swapped back into the main memory so that the fre- 
quency of matin memory references is highly re- 
duced. In return, the memory interference is also 
reduced. 

Unfortunately, such a multicache system, as 
shown in Figure 1, causes coherence problem be- 
cause multiple copies of a main memory block may 


reside in multiple private caches at any given 
time. In general, a coherence problem occurs as 
soon as two or more access paths to a single data 


entry exist simultaneously. This problem is vital 
to the integrity of the system and is regarded as 
the major obstacle in the design of a multicache 
system. In order to eliminate such a coherence 
problem once and for all, an interesting shared- 
cache system, as shown in Figure 2, has been pro- 
posed and extensively studied by Yeh [16]... Each 
processor, instead of talking to its private 
cache, goes through the interconnection network 
and then talks to a cache which is shared by all 
processors. In principle, the philosophy behind 
this proposal is to apply the cache memory tech- 
nology to the conventional main memory and omit 
all privately owned caches. As a result, the ori- 
ginal shared main-memory now becomes the secondary 
memory. The Level of boundary of memory hierarchy 
for context switching during a page fault, howev- 
er, is pushed one level down to the boundary 
between the second and the third level of memory 
hierarchy in this case. For such a shared-cache 
system, the coherence problem is truly eliminated 
but all the original problems which Lead us to put 
the cache into a multiprocessor computer still ex- 
ist, such aS memory interference and transmission 
delay of interconnection network. 


In the following sections, we first illustrate 


coherence problems in detail and _ then discuss 
various solutions for them. 

2. Coherence Problem 

~ Coherence problem may occur in a multicache 
system fren data inconsistency exists in_ the 
caches and the main memory. Multiple copies of a 


memory block may exist in several 
Modification of any copy of this 
processor in its cache will 


given main 
private caches. 
shared block by a 
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cause an obsolete value of this shared data in 
every other cache. Data inconsistency thus occurs 
in the caches. To take specific examples, Let us 
consider a multicache system with N caches, C. for 


j=1,-.-,N, and a main memory being shared by all 
processors, P. for i=1,...-,N. Let X be the physi- 


cal address of a main memory block issued by the 
memory mapping function of a processor. When a 
copy of block X resides in C., let the correspond- 


block Thus, “block X" 


is always used to denote a specific block in the 
main memory, and Ny," implies that a copy of this 


ing cache address be Ye 


main memory block resides in the cache block y of 


C.. In the following two examples, coherence 
problems arise: 
(E1) Data are assumed to be shared among 


processes. a reads’ block Y; without noticing 
that block yj has been modified by e 


(E2) A process is allowed to switch among pro- 
cessors. Process A may be executed on two proces- 
sors in a sequence such as eee + P. > es > Pi. As 


a result, process A may have a copy of block X in 
both C. and Ce. If process A has modified block 


Y je 
Switching back to Pi. 


it will then read obsolete data from Y. after 


Modification of a copy of block X by a _ proces- 


sor in its private cache will result in data ob- 
soleteness in the main memory if block X %s_ not 
updated immediately. As a result, coherence prob- 
lems may also arise under the following sttua- 
tions: 

(E3) Data are assumed to be shared among 


processes. A copy of block X 7s brought into c. 


upon a miss while another copy of block X has been 
modified in a and this modification has not yet 


been reflected in block X. 
(E4) A process is allowed to switch among pro- 
cessors. Process A 1S_ running on P. first and 


After A has 


switched, the most recently modified data of pro- 
cess A may still be in C.. Hence process A_ run- 


then switched to coe process been 


ning on Z could read obsolete data from the main 


memory upon a miss. 

These examples are expressed from a _ process’ 
point of view. Viewing from a processor, in fact, 
the problems posed in (E2) and (E4) are exactly 
the same as the problems posed in (E1) and (E3). 
Since cache memory management is carried out in 
hardware it is much easter to deal with processors 
rather than with processes for a multicache sys- 
tem. Thus, restricting processes switching among 
processors to eliminate (E2) and (E4) does not ac- 
tually simplify the problems. On the other hand, 
(E3) and (E4) can be eliminated if the matin memory 
update policy is write-through instead of flag~- 
swap. Nevertheless, without buffering, the rate 
of accessing the main memory can not be Lower than 
the write rate of a processor in a _ write-through 
policy. In the next section, some previous solu- 
tions are described. We will then present a new 


scheme in which the memory reference of a proces- 
sor is made as fast as possible. In particular, 
this scheme is developed in a systematic top-down 
manner. 


3. Previous Solutions 

~ A commonly used coherence scheme in commercial 
computer systems with a small number of processors 
is to connect every cache to a high-speed bus. on 
which the addresses of the block to be modified 
are sent. Each cache permanently monitors” this 
bus and invalidates the affected block in case of 
a hit. In the mean time, the write-through policy 
is used to insure the update of main memory. This 
scheme has many weaknesses C2]. The invalidation 
traffic on the bus is often very high since the 
mean write-rate for most processors is between 10 
and 30 percent. The peak rate iS even much 
higher, and, a buffer may be needed for each cache 
to queue up the invalidated addresses. Moreover, 
a different coherence problem. may occur due _ to 
these invalidation queues. Finally, the rate of 
cycle stealing of the cache directory to perform 
the search for those invalidated addresses iS so 
high that only a small proportion of cache direc- 
tory cycles is free for normal operations. ALL of 
these explain the reason that this scheme has been 
Limited to systems with no more than two caches. 

The caches in C.mmp [5] implement write-through 
in the matin memory, but the contents of caches on 
other processors are not affected. The coherence 
problem is resolved by having the operating system 
to designate which pages are safe to cache via the 
cacheable bit in the relocation registers. Thus, 
all those shared writable pages have to be in the 
main memory. only. In other words, the shared 
writable data are centrally managed. The draw- 
backs of this scheme are the need of a special 
operating system and the hit ratio of caches is 
inadequate for a high-performance computer. It 
should be noticed that the special assist required 
from the operating system may not be necessary in 
a capability-based system with architectural sup- 
ports. In addition, for specific environments, 
the resulting cache hit ratio may be adequate’ for 
a low-budget multiprocessor computer as well. 
However, this solution obviously is an inherently 
Limited approach. 

More recently, three closely related schemes 
have been developed independently by Tang (141, 
Censier and Feautrier £21, and Widdoes [15]. They 
treat each block in the main memory as a semi- 
critical section [3]. This means that a block X 
can be shared among several readers but can only 
be accessed by one writer. Here, a reader’ stands 
for a cache C. in whitch a copy of block X resides 


and in the mean time this copy has only been read. 
A writer represents a cache C. in which a copy of 


block X resides while this copy has been written 
by processor Pi. ALL the readers or the writer of 


block X are recorded in a Logically centralized 


map. This map is dynamically updated whenever the | 


State of any semi-critical section is changed. In 
other words, such a map 1S designed to keep track 
of all the readers or the writer of each block xX 
in the main memory. Hence, not only the ir- 
relevant cache invalidation requests can be fil- 


tered out but the cache where the most recently 
modifted data of block X reside, can be identi- 
fied. As a result, the flag-swap policy is adopt- 
ed in all three proposals. 

This map-based approach certainly requires more 
hardware, but offers a much better performance 
especially when the multiprocessor computer con- 
tains more than two or four processors. It solves 
the coherence problem without knowing the seman- 
tics of the content of each main memory block. 
Thus, this is a totally transparent approach. 
However, this approach is based on the concept of 
semi-critical section not only logically but also 
physically. Consequently, when the first time a 
processor is trying to write into a cache  bkock 
which was loaded upon a read miss, the processor 
can not execute this write until the state of the 
corresponding matn memory block X is changed even 
if its cache owns the only copy of block xX. We 


thus refer this approach as the PSCS (Physical 
Semi-Critical Section) scheme in contrast to our 
LSCS (Logical Semi~Critical Sectton) scheme 


presented in the next section. A digest of the 


PSCS scheme can be found in (7). 


4. The LSCS Scheme 

~The purpose of using a cache is to feed the 
data to a processor as fast as the processor 
demands. Thus, the cache is often integrated into 
a processor unit and implemented by the same tech- 
nology as the processor. Moreover, the management 
of cache memory is completely made by hardware and 
makes decistons locally as much as_ possible so 
that the response time to a processor's demand can 
be minimized. 

The objective of the proposed LSCS scheme is to 
reduce the effective cache access time by making 
as many local responses as possible. There are a 
main memory controller MC and a cache controller 
CC. for each cache C.. These controllers run 


asynchronously. Commands are exchanged between 
the cache controller and the main memory controll- 
er. A local response means that a cache controll- 
er can permit a processor accessing its cache 


without interacting first with the main memory 
controller. To understand this proposed scheme 
clearly, let us define that the legal state of 
block X viewed from MC is X-=state {a,b,c,d}, 


where 


as no copy, or no valid copy, of block X is 
in caches, 

b: block X ts updated and only a single copy 
of block X is in caches, 

c: block X 1s updated and multiple copies of 
block X are in caches, — 

d: block X is obsolete and only a single copy 
of block X is in caches. 


Let the Legal state of a copy of block X in Cc. 


viewed from ce. be y,~state {a,B,y,o+ where 
a: a copy of block X may be in C., however, 


it is invalidated. 
B: an intact copy of block X is in C. and it 


is the only copy of block X in caches. 
y: an intact copy of block X is in C., howev- 


er, there is one or more copies of block X 
in other caches. 
ot a modified copy of block X is in C. and it 


is the only copy of block X in caches. 


These states are illustrated in Figure 3. NP (X) 
represents the number of copies of block X which 
resides in caches. Obvious correspondences 


between b and g, c and y, d and o respectively can 
be recognized. Any change of y,~state must be re- 


flected in the X-state. 
Figure 4 specifies’ the 

quired to maintain data 

operation is performed by processor P.. 


state-transitions re- 
coherence when a write 
In case 


of a cache hit on the write reference to C., there 
If 


is the same as a 


are four possible states for y,~state. 
y.~state 


; the 


cache miss except that the cache replacement algo- 
rithm does not have to be executed. If y,-state 


is a, Situation 


is pg, CC. signals MC to declare a write into block 


X so that X-state has to be changed from b to d; 
in the mean time, P.*s write operation and the 


change of y,-state from 8 to o are carried out as 


well. This means that a processor's write opera- 
tion is not delayed if a cache hit is in state 8. 
However, tf MC is invoked by another cache con- 
troller to inquire the state of block X at this 
time, there is a slight possibility that the X- 
state is still tn state b but the corresponding 
y,~state has been changed to o. We call this 


problem "the uncertainty of state b”, that is, 
when MC detects a block X in state b it can not be 
sure that the block X is indeed in state b or ac- 
tually in state d. This problem, fortunately, 
does not complicate the cache control mechanism as 
much as it first appears to be and will be dis- 
cussed and resolved Later. If y,~state is Y, cc. 
writing in 
not write into the 


exclusive 
can 


signals MC to declare an 
block X. Processor P. 


corresponding cache’ block Y; until a X-state- 


transition completion signal from MC is received. 
In order to change the X-state from c to d, MC has 
to inform every other cache controller which owns 
a copy of block X in its cache to invalidate that 
copy. In other words, one request’ on the X- 
state's transition from c to d will invoke one or 
more y~state's transitions from y to a. 
In case of a cache miss on the write 
C., a 
and a cache block Y, will be selected for the 


missing block X. However, 1f the copy of block xX‘ 
which originally resides in the selected cache 
block y. has been modified, it needs to be swapped 


back to update the main memory. MC will also 
check to see if there is only a single copy of 
block X' Left in all other caches after C. has re- 


reference 


to replacement algorithm will be executed 


placed this copy of block X'. If it is the case, 
the corresponding cache controller, say cee, will 


t 
be signaled to change the y,~state from y to B to 
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q 
declare that the cache block Y; owns the only copy 


of block X'. 


Before a copy of block X being loaded into C. 


MC has to declare an exclusive reading of block X 
for Ci. Thus, all cache controllers which have a 


copy 


y 


of block X in their caches will be signaled 
to invalidate those copies. In addition, if MC 
detects that a copy of block X has been modified, 
the main memory needs to be updated first. The 
uncertainty of state b of block X does not really 
complicate the declaration of exclustve reading. 
No matter what the X-state really is the cache 


controller which has the only copy of block X in 
its cache has to be signaled to invalidate this 
copy. Therefore, whether or not this copy. has 
been modified can be checked at the same time. 
Figure 5 specifies the state-transittons re- 
quired to matntain data coherence when a read 
operation is performed by processor P.. Nothing 


has to be done for reading block X if there is a 
valid cache hit in C.. IF a cache miss occurs’ on 


the read operation to C., the same replacement al- 


gorithm as the one specified in Figure 4 will be 
executed. Before a copy of block X being loaded 
into C., however, MC declares a shared reading of 


block X for C.. Now, an extra work has to be done 


by MC for resolving the uncertainty of state b of 
block X, it is not required otherwise. MC needs 
to signal cc to check the y ;~state if a copy, and 


the 


tatnty checking can be done in parallel with Load- 
ing a copy of block X into C.. However, if unfor- 


only copy, of block X is in Cie This uncer- 


tunately y;~state is indeed d, this is rare, MC 
will be signaled by to update the block X and 
reload an updated copy of block X into C.. Note 


that we have to pay attention to the timing of 
this uncertainty checking since it has to be com- 
pleted before processor P. reads block Yas 

5. Considerations of implementation 

Sample implementations of the specifications in 
Figures 4 and 5 are illustrated in Figures 6 and 7 
respectively. A 3-bit tag (2 bits if encoded) is 
associated with each cache block to represent the 
y,~state. The tag is interpreted as follows if it 


Is set. 


v.CyJ: valid bit. The copy of block X in cache 


block Y ; 1s valid. 


S, Cyd: Single bit. The copy of block X in’ cache 
block Y is the only copy of block X in 
caches. 

m-CyJ: modify bit. The copy of block X in_ cache 


block Y. has been modified by Pi. 


A (N+1)-bit tag is associated with each block in 


the main memory to represent the X-state. The tag 

is interpreted as follows if it is set. 

PCX,iJ: ith bit in the present array. A copy of 
block X 1s in C., where 1=1,...+,N. 

MCX): modify bit. Block X is obsolete. 

In addition, a combinatorial Logic circuit NP(X), 

which tells us how many bits are set in the 


present array of block X, is required in MC. 
There is Little problem with the cache tag or- 


ganization. However, there is a variety of dif- 
ferent ways to organize the main memory tag for 
each block, which directly affects the organiza- 
tion of MC. Two most intuitive approaches are 
available. One is to include the (Nt+1)-bit tag 


into each main memory block so that the tags are 
actually a portion of the main memory space and 
spread out all over the entire space. The other 
is to aggregate all tags into a dynamically 
managed bit map which can be implemented by a fas- 
ter device. However, the former approach suffers 
a slow memory reference time; the Latter approach 
has the problem of contention. 


We suggest to physically distribute the MC (of 


course, all tags) as illustrated in Figure 8. 
Furthermore, the main memory is interleaved by 
both higher and Lower order bits. Two Levels of 


distribution are intended to obtain. The lower 
level is achieved by associating a module of MC 
with each main memory module. Thus, the conten- 
tion in MC is highly reduced. On the other hand, 
the higher Level is achieved by interleaving the 
memory in higher order bits so that the availabil- 
ity of the main memory is provided. Note that the 


interleaving on the Lower bits is also necessary 
for reducing potential memory interference = on 
shared code. 

With regard to the position of the lower bits, 


it 
leaved by block, subblock, or word. 


1s dependent on that the matn memory is inter- 
This further 


depends on the ratio of interconnection network 
Switch time (circuit switch) and main memory 
module access time. In other words, if the cir- 


cult switch set-up time is relatively Long, inter- 
leaving by block would be a better choice. On the 
contrary, if the main memory access time is rela- 
tively long, to interleave the matin memory by sub- 
block or word would then be better. Such an ar- 
chitectural decision can be made in terms of mani- 
pulating the parameters of main memory access time 
te and block transfer parameter y in our earlier 


models reported in (18. 

Thus, each MC module actually contains a set of 
tags (a portion of the dynamically managed bit 
map) for the corresponding matn memory module = and 
a replica of MC Logic circuit and microprogram. 
The MC module may be implemented by a faster dev- 
ice than the matn memory module and the operations 
in MC may also be overlapped with those in_ the 
main memory module. As a result, the performance 
degradation due to the addition of coherence 
mechanism can be highly reduced. We can also take 
advantage of such an organization to include 
buffers in MC and cc .. Because now these con- 


trollers are physically associated with each indi- 
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te 


vudual memory (main memory and cache) modules, the 
additional buffers do not affect the coherence 
mechanism much. The system performance can thus 
be further improved. 


6. Performance Estimates 

The use of coherence mechanism always degrades 
the system performance. The effect of the scheme 
based on the map-based approach appears in both 


cache hit ratio and effective memory access time. 
A lower cache hit ratio will be observed due _ to 
inevitable cache invalidations, while the need of 
interactions with memory controllers slows down 
the memory access. The cache invalidation rate 
will be the same for all map-based schemes; thus, 
the effective memory access time is used as the 
performance index. A rough comparison between 
PSCS and LSCS schemes is given in this section. 
Let us assume that the probability of a valid 
hit in cache is h. The probability that a cache 
block contains the only copy of a main memory 


block is (1-9), where p is the multi-copy coeffi- 
cient. Furthermore, the mean time required for 
completing a memory access in_ case of a valid 


cache hit and without consulting with MC is t. 
The mean time required for completing a memory ac- 
cess in case of a valid cache hit but requiring a 
consultation with MC is t'. The mean time re- 
quired for completing a memory access in case of a 
miss or an invalid cache hit is T. The effective 
memory access time may be obtained by assuming 
that the proportion of data will be brought in the 
cache by a read miss and be modified subsequently 
by a write is 9. Then, the effective memory ac- 
cess time for the PSCS scheme is 


Wp = hC1-e(i-hit + hedci-ht' + (1-h)T, 


and for the LSCS scheme is 


OL hC1-epCi-hoJt + hep (i-hot' + (1-hdT. 


Thus, the difference is 
Wp ~ Ww = he(1i-p) (1h) C(t '=t). 


The amount of performance improvement is directly 
related to the specific organization and implemen- 
tation of MC. Nevertheless, this gives a typical 
performance improvement of about 5 to 15 percent. 
In particular, the additional hardware overhead of 
the LSCS scheme is almost negligible. 


Concluding Remarks 
The coherence problem in a multicache system 
has been treated in a systematic top-down manner. 
The proposed LSCS scheme offers a better perfor- 
a with negligible additional hardware over- 
ead. 
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context switching 
boundary 


Figure 2. The shared cache system 


e Y,~state: the legal state of a copy of block X in c. 


viewed from cc.. 


v CyI=0 


m.CyJ=1 m.CyJ=0 
eX-state: the legal state of block X viewed from MC 


NP (X)=0 NP (X)=1 NP (X)>1 


updated oosolete 


Figure 3. The states of block X 


REPLACEMENT 


SHARED READING 


case X-state of 
a: a +b; 
/declare y; contains the only copy of X/ 


b: b + c3 
/check the uncertainty of b state;/ 
/checking can be done in parallel with/ 
floading X/ 

cs: ¢ + C3 

d: d +c; 

/update X/ 
load X + Yaa 


Figure 5. State-transitions for a read operation 
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REPLACEMENT 


select a cache block y to be replaced; 
if the copy of block X' in y modified then 
update X'; 
if NP(X')=1 after replacement then 
declare single copy of block X'; 
/this single copy delaration can be/ 
/carried out in parallel with C.*s/ 


DECLARE 
WRITE 


b +d; fexclusive reading/ 
/X-state change/ 
/is carried out/ 
/in parallel with/ 


/write y./ 


DECLARE 
EXCLUSIVE 
WRITING 


c +d; 
finvalidate all/ 
/other copies/ 
fot block X/ 


EXCLUSIVE READING 


case X~state of 
a: a-+d; 
b: b + d; 
c: c + d3 
/invalidate all copies of block X/ 
d: d + d; 
fupdate X/ 
load X + Yas 


Figure 4. State-transitions for a write operation 


get a cache block y; 
if v,CyJ=1 then 


— 


cobegin 
pid m.CyJ=1 then 


cobegin 


MCX) :=1; swap (y) + X'; 
for all j that j#i and MCX*}:=0; 
PLX,jJ=1 do PLX 43:20; 
fork: if NPC(X')=2 then 
:cobegin ~~ find j, signal CC.; 
PEXj42°0; /cc. set s.€ ‘ys81/ 
v5 Cy}:=0; j dees 
coend; coend; 
joint NP(X)-1; 
coend 


for alt j that PCX,jJ=1 do 
fork: 2 


:cobegin 
PUX,j):=0; 
vj CyJ:=0; 
if m CyJ=1 then 


swap Y) +X; 
coend; 
_ joint NP(X); 
cabegin 
PLX,i):=1; 
MCX3:=1; 
load (X) + Yai 


coend; 


Figure 6. A sample implementation of a write 
operation 


> 2 eee Lal 


get a cache block y; 
if viCyJ=1 then 
cobegin 
if m,Cy2=1 then 
swap (y) + X'; 
MCX'):=0; 
PCX*,i]:=0; 
if NP(X")=2 then 
find j, signal CC 5s 
pce. set 8 Cytist/ 


coend; 


find j that PCX,j]=1 do 


* cobegin 
swap ty) +X; 


cobegin cobegin cobegin 
s;CyI:=0; tind j that PLX,j3=1 do load (0) + y,; load (x) + y,; 
m.Cy):=0; gobegin PLX,i}:21; PLX,i}:=1; | 
ends cata coend; coend; ’ Rett ee eas Me eee —_ 
cobegin. af mCyJ=1 then = ——;_ a, Sees, 
load (x) + Ys3 oto *; ; 
PLX,i):=1; coend; MAIN MEMORY 
MCX):20; load (Xx) + Y53 
foend; PLEX, 11:21; 


HIGHER ORDER 
BITS 


@ MEMORY ADDRESS 


LOWER ORDER 
BITS 


Figure 7. A sample implementation of a read Figure 8. The organization of MC 
operation 
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Abstract -- Designs for 


distributed systems 
with dynamic structure can be very difficult to 
understand and reason about. Constrained 


expressions, a closed 
representation of all the possible behaviors of 
such a system, can help designers in analyzing a 
dynamically-structured distributed system's 
design. In this paper, the constrained expression 
formalism is introduced, an effective procedure 
for deriving constrained expressions from 
procedural design descriptions is outlined, and an 
example illustrating the use of this procedure in 
analyzing a design is presented. 


form, non-procedural 


Introduction 

The Dynamic Process Modelling Scheme (DPMS) 
and its main descriptive component, the Dynamic 
Modelling Language (DYMOL) [1], were developed to 
provide a foundation for software design tools 
applicable to distributed systems with dynamic 
structure. A distributed system with dynamic 
structure is one in which some or all of the 
System's components may come into and/or go out of 
existence, and intercomponent communication paths 
may be established and/or closed, during the life 
of the system [2]. Systems of this kind are 
increasingly common; they include such things as 
process-based distributed operating systems, 
computer networks, and the tasking facility in 
Ada. The potential value of DYMOL as a language 
to aid designers of dynamically-structured 
distributed software systems has been demonstrated 
in [1]-([3]. 

Dynamically-structured distributed systems 
can be very difficult to understand and reason 
about, primarily due to the subtle interactions 
among system components that can arise due to 
dynamic structure and concurrent activity in such 
systems. Hence, it is useful to have techniques 
Supporting the succinct description and rigorous 
analysis of the possible behaviors of a system of 
this kind. 
expressions, a closed form, non-procedural 
representation for all the possible behaviors that 
could be realized by some dynamically-structured 
distributed system. 

Constrained expressions are related to other 
regular expression-based description languages [4] 
such as event expressions [5], path expressions 
[6], flow expressions [7] and counter expressions 
[8]. Constrained expressions are more general 
than any of these related languages, however. 
Constrained expressions also permit the 
description of the behavior of 
dynamically-structured distributed systems, which 


cS rc RR NE TE IS STDP OIE ST SCS AS POO AP LIES 
*Supported in part by the National Aeronautics and 
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To this end, DPMS provides constrained 
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these other languages do not. 


In this paper, we introduce the constrained 
expressions formalism, outline the effective 
procedure for deriving constrained expressions 


from DYMOL design descriptions and give an example 
illustrating the use of this derivation procedure 
in informally analyzing the DYMOL design of a 
dynamically-structured distributed system. We 
also outline our current and future work on formal 
analysis techniques for dynamically-structured 
distributed system designs based upon the 
constrained expressions formalism. 


Constrained Expressions 


Informal Description 


Constrained expressions are a closed form, 
non-procedural representation of concurrent 
behavior in the same sense that regular 


expressions are a closed form, 
representation of the behavior 
machines. In fact, 


non-procedural 
of finite state 
the operators used in 
constrained expressions include the standard 
regular expression operators (concatenation, 
alternation, transitive closure) as well as_ two 
operators (interleaving and its transitive 
closure) used to represent concurrent activity. A 
constrained expression is formed by using these 
operators to combine symbols from an alphabet of 
events in the system being described into a 
collection of subexpressions, one subexpression 
for each component in the modelled system. The 
interleave of these subexpressions then represents 
the unconstrained set of possible system 
behaviors, ignoring such fundamental properties as 
the necessity of a message's being sent before it 
can be received or an intercomponent communication 
channel's being opened before it can be used in 
message transmission. The required fundamental 
properties are formally described by a second 
collection of subexpressions, called the 
constraint set. Then the set of behaviors (or, in 


formal terms, the language over  the- event 
alphabet) described by the overall constrained 
expression is just what remains’ after the 
unconstrained set of behaviors is filtered by the 


constraint set. This filtering process can be 
formally defined as a set intersection. 


Formal Definitions 

Constrained expressions define languages over 
an alphabet, E, of distinguished events. The 
expressions are composed of symbols from E, 
symbols from an auxiliary alphabet S of constraint 
symbols, the special symbols A (null event 
sequence) and @ (empty set of event sequences), 
and a set of operator symbols. The constrained 


the 


expression operators include familiar 
operators of regular expressions alternation 
(represented by U), concatenation (represented by 


juxtaposition), and transitive closure 
(represented by *) -- plus two operators, A and 
+, used to represent concurrent activity. The 
MQ operator Signifies the shuffling or 
interleaving of the two strings that are its 
operands. The unary operator + denotes. the 
interleaving of zero or more copies of its 


operand. (a) 

In defining the language represented by a 
particular constrained expression, we begin with a 
representation over an augmented alphabet and use 
an interpretation rule to produce a set of strings 
over the actual alphabet of § interest. In 
particular, we let E and S be two disjoint, finite 


sets called the event alphabet and the constraint 
alphabet, respectively. A constraint set, CS, 


consisting of n constraining languages, C. for 


1<i<n, can then be defined with respect to n 
disjoint subsets, S,, of S. Each such C, is 
represented by an expression over Sas ex(S,), 
formed using any of the event expression 
Operators, and interleaved with E* and S; * for 
j7i. That is, 


Csaex(S)AS1FA-.-O55-18A S418... A Sn*AE* 


for each constraining language Cj, 1<i<n. A 
constrained expression with respect to CS is then 
defined to be any expression over (E U S) that can 
be formed uSing the event expression operators 
other than +. This expression thus represents a 
regular language L', which is a subset of (E U S)* 
and which we call the uninterpreted language of 


the constrained expression. We also define a 
homomorphism H:(E U S)* — E* by: 

H(e) =e for all ein E 

H(s) = A for all sin S 


Finally, for a given constrained expression with 
respect to CS which represents the uninterpreted 
language L', we define the interpreted language, 
L, represented by the expression to be the set of 
Strings over E (i.e., subset of E*) described by: 

(1) 


L=H(L'(\ Cy N--- NC,) for C; in CS 


This definition generalizes the definition given 
for counter expressions in [8] by allowing for the 
application of multiple constraints in determining 
the acceptability of a string. 


Analyzing Designs 


For an important sub set of 
dynamically-structured distributed systems, DYMOL 
and constrained expression descriptions are 
(a) 


a 

The shuffle operation was first defined and 
Studied by Ginsburg [11], while Riddle presents a 
thorough formal discussion of both of these 
concurrency operators in [5]. 


(b) Proofs that expressions of this form represent 
regular languages may be found in [11], [5], and 
[8]. 
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related by an effective procedure for deriving the 
constrained expressions describing the potential 
behavior of any given DYMOL description of a 
system. This effective procedure is similar to 
the syntax-directed translation scheme commonly 
employed in compiler construction [9]. By using 
this procedure, the designer of a 
dynamically-structured distributed software system 
can derive a succinct representation of the 
possible behaviors of a system whose design is 
described in the more natural, procedural language 
DYMOL. This succinct, constrained expression 
representation provides the basis for an informal 
analysis of the DYMOL design, since it can expose 
intercomponent synchronization or communication 
anomalies. It also serves as the starting point 
for more formal analysis methods, based upon the 
derivation of systems of inequalities from a 
constrained expression description, which are 
currently being developed [10]. 

For the subset of DYMOL-described systems 
that we are considering here, the constraint set 
CS consists of three constraining languages. The 
first, C1, describes the necessary restriction on 
transmission of messages in a distributed system. 
C,; is expressed using subset S; of the constraint 
alphabet S, where S,={@;,@;'}. The special symbol 


@; can be thought of as corresponding to the 
Sending of a particular message along a particular 
message transmission channel, with the subscript i 
indexing the specific message type and _ channel 
pair. Similarly, the symbol @;' may be thought of 
aS corresponding to the receipt of a particular 
message on a particular channel. The constraining 
language C, is represented by the expression 


C= /\(@i*A( 0103" +082 *AS 3%AE* 
1 

The event expression (bc)+ represents a set that 
may be described as "all strings containing an 
equal number of b's and c's such that any prefix 
of any string always contains at least as many b's 
as c's", Thus, C, describes the requirement that 
the reception of a message is always preceded by 
the sending of a corresponding message, although 
more messages may be sent than are ever received 


(due to the interleaved @;*). This precisely 
captures the message transmission semantics of 
DYMOL. 


Constraining languages C2 and C3 are formed 
using subsets So={$;,$;'.&1,%7'} and S3={#4,#4'} of 


S, respectively. For brevity and simplicity, we 
omit their detailed definitions. C2 describes a 
constraint on the use of interprocess 


communication channels in a dynamically-structured 
distributed system. Specifically, it stipulates 
that message transmission may only take place 
along channels that are currently operational. C3 
Similarly governs the use of message contents in 
determining the flow of control within a process 
in a DYMOL model. (See [2] for details on these 
constraining languages.) 

In the setting of the constraint set 
CS=C, UC, UC3, analysis of a design modelled in 
DPMS proceeds in two stages. First, the effective 
procedure is applied to the model, translating it 
from a DYMOL description into a constrained 
expression. This procedure, defined in detail in 


[2], is very similar to the translation performed 
by acompiler, being based upon translation rules 
associated with each syntactic construct of DYMOL. 
The result of this translation is an expression 
consisting of message type names (the elements of 
the event alphabet E in this setting), symbols 
from S and the constrained expression operators. 
The second stage of analysis involves inspection 
of the language represented by the derived 
constrained expression. Specifically, strings in 
the language that correspond to either desirable 
or undesirable system behaviors are sought. While 
no completely general algorithmic approach to this 
search is possible, it nevertheless can often 
result in useful information to guide the designer 
of a dynamically-structured distributed system. 
Several examples, such as those in [1]-[3], have 
demonstrated the value of this technique. 


Examples 

Although space limitations preclude a fully 
detailed treatment, we offer the following 
excerpts from an example in [2] to illustrate 
various facets of the preceding discussion. . 

This example concerns a DPMS model of a 
distributed system with dynamic connectivity, 
namely a producer-consumer situation in which a 
producer generates information that can be 
processed by either of two consumers. The 
producer in the modelled system generates a stream 
of variable-length information packets, each 
packet postfixed with a termination indicator. It 
is intended that each packet constitute a single 
complete, coherent set of data for a consumer. 
Therefore, proper processing requires, that each 
complete packet, including its termination 
indicator, be received by exactly one \ consumer. 
Each time that the producer is prepared to 
generate an information packet, a manager process 
called c pool selects a consumer to receive the 
information. When the producer has completed the 
generation of a packet, c_ pool is notified to 
disconnect the producer from the most recently 
active consumer. 

Figure 1 gives the DYMOL description of the 
producer process of this model. (See [1] for 
details on DYMOL and [2] for a complete version of 
this example model and its analysis.) Following 
each sending of the 'ready' message indicating its 
intention to generate an information packet, the 
producer (at p4) awaits an indication that a 
consumer is prepared to receive it before actually 
commencing generation of the information. The 
consumer's confirmation of each reception of a 
"goods' message is used by the producer (at p8) to 
ensure that the items in an information packet are 
consumed in the same order in which they were 
produced. The producer's 'term' message (p9 and 
p10) is the packet termination indicator sent to 
the consumer, while the 'done'. message (p11 and 
pi2) informs the consumer pool manager, ec pool, 
that a complete packet has been generated. 

The result of applying the constrained 
expression translation procedure to the DYMOL 
description of the producer process is shown in 
Figure 2. This is the subexpression describing 
the unconstrained possible behavior of this 
process. Figure 3 shows the complete constrained 
expression for the full modelled system, including 
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producer: pi: WHILE INTERNAL TEST DO 


BEGIN 
pe: SET BUFFER := ready; 
p3: SEND cp; 
p4: RECEIVE ok; 
pb: WHILE INTERNAL TEST DO 
BEGIN 
p6é: SET BUFFER := goods; 
p7: SEND info; 
ps: RECEIVE ok; 
END; 
p9: SET BUFFER := term; 
p10: SEND info; 
pit: SET. BUFFER := done; 
pte: SEND cp 
END. 


produceri 


Producer Process 


Figure 1 
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The Producer Process Subexpressions 


Figure 2 
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the producer, the consumers, and c pool. Now, 
using the interpretation rule for constrained 
expressions (1) it can be determined that one 
string contained in the language described by the 
Figure 3 constrained expression is: 


ready ready ready ready goods goods got_it got it 
term done done ready ready ready ready goods 
goods got it got_it term done done term 


The message sequence represented by this 
string corresponds closely to the expected 
sequence of message transmissions in the modelled 
system in a case where the producer sends two 
single—-item information packets. A sequence of 
four \C/ 'ready' messages would naturally appear as 
the producer process signalled c_pool of its 
intention to generate a packet and c_pool 
responded by indicating that a consumer was 
prepared to receive the information. The 
transmitted information ('goods') and the 
confirmation of its reception ('got_it') follow as 
expected, and finally the '‘'term' and "done! 
messages indicate the completion of a packet 
transmission. The intermixing of the final four 
symbols of the string (in fact, the derived 
constrained expression permits any ordering of the 
last three symbols in this behavioral 
representation) indicates that the receiving of 
the '‘'term' and 'done' messages by a consumer and 
the c pool process, respectively, are potentially 
concurrent events. 

Examination of the string reveals one 
unexpected aspect of the modelled system's message 
transmission behavior, however. The string 
contains only three ‘'term' symbols, indicating 
that one information packet termination indicator 
was not received by a consumer process, although 
it was sent by the producer. This is an 
unacceptable situation under our previously-stated 
assumption that a complete packet, including the 
termination indicator, should be received by a 
Single consumer each time such ae packet is 
generated. Yet the fact that this string is a 
String: in the interpreted language of the derived 
constrained expression indicates that this 
unacceptable behavior can be realized by the 
system as currently modelled. To the software 
System designer contemplating this DPMS model as a 
possible design for the producer-consumer system, 
this would presumably indicate that the proposed 
design was faulty. In fact, examination of the 
Figure 3 constrained expression can lead to 
discovery of the source of the difficulty. This 
is done in [2], where a revised version of the 
DYMOL design is then presented. Performing the 
constrained expression analysis on the revised 
design demonstrates that the difficulty has indeed 


been corrected. 


(Cc) Bach completed communication of the modelled 
system generates two symbols in the message 
sequence describing its behavior. One represents 
the movement of the message from sender into the 
message channel while the other signifies the 
message's movement from the message channel to 
receiver. Under the semantics of DPMS, these 
events can be arbitrarily separated in time, hence 
the corresponding symbols need not, in general, 
occur adjacent to one another in the string. 


Conclusion 

In this paper we have introduced the 
formalism of constrained expressions and suggested 
how constrained expressions can be used in 
analyzing designs of dynamically-structured 
distributed systems. As our example indicates, 
however, the derived constrained expressions can 
be long and unwieldy. Therefore, analysis based 
upon manually manipulating and inspecting the 
derived expressions is likely to be incomplete and 
error prone. At the very least, one would like to 
have automated tools for generating example 
Strings from the language represented by a given 
constrained expression. Such tools would simplify 
the 
likelihood of errors. Fortunately, such tools are 
quite straightforward to build. Ideally, one 
would like formal techniques, more powerful than 
Simple inspection, for uncovering anomolies in the 
behavior described by a constrained expression. 
We are presently working on developing such formal 
techniques [3]. Our approach involves deriving a 
system of inequalities from a constrained 
expression description, then attempting to solve 
the system of inequalities. This process formally 
demonstrates the presence or absence of certain 
types of behaviors in the modelled distributed 
system. Examples of this technique appear in [3] 
and [10]. 
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Summary 


A two-dimensional distributed multiprocessor 
system structure based on the concept of splitted 
bus was proposed in (1]J. It was noted that intro- 
duction of switches into bus structure makes the 
multiprocessor systems highly flexible and cost- 
effective. The probability of bus contention can 
be reduced, thus resulting in better line utili- 
zation and shorter message delay time. The ease 
of routing control by means of switches helps in 
organization of reconfigurable and partitionable 
distributed systems. 


The one-dimensional structure based on 
splitted-bus concept proposed in this paper looks 
like a wheel, as shown in Fig.1. It consists of a 
data loop and star-shaped control links between 
the centralized routing controller and processor 
nodes. The data loop is fully duplex and can be 
splitted into segments by the routing switches, 
so that separate communication paths can be esta- 
blished at once for several distinct pairs of 
nodes to transfer variable-length messages so long 
as these paths do not conflict with one another. 
Three switches are needed for each processor node: 
one central switch for segmentation of the data 
loop, and two side switches for connecting the two 
ports of the processor to the loop across the cen- 
tral switch. Such an arrangment makes it possible 
to realize different modes of interconnection: 
broadcasting (one-to-all), shifting (every node 
to its neighbor by any modulo count), and random 
(multiple one-to-one). All these modes of inter~ 
connection are useful for parallel computations 
on a multiprocessor system. Furthermore, the 
installation of three switches for each node 
permits all or any number of processors to be iso- 
lated and operate independently without affecting 
the normal operation of the rest of the system. 


For purpose of analysis, we establish two 
models for the system: one for the control proces- 
sor, and the other for the data loop. 


The job arrival process is modeled by two 
waiting queues in the control processor. The queue 
A accepts the message-communication requests from 
all nodes and delivers them one by one to find the 
desired interconnection paths on the basis of 
First-come-first-serve discipline. If the control 
processor fails to find path for some job, then 
this job enters the second queue B, which is given 
a higher priority than the queue A. The arrival of 
jobs to the queue A assumes a Poisson distribution 
with mean arrival rate NX packets per second, 
where N is the total number of processor nodes on 
the loop, and X is the mean delivering rate of 
packets -from each node. All the nodes are assumed 
to be identical. The message length per packet 
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assumes a negative exponential distribution with 
mean length 1/u bytes per packet. 


The model of the data communication loop 
consists of N separate models of the nodes, each 
of which is composed of receiving buffers, trans- 
mitting buffers, and the data link i connecting 
two adjacent nodes i and (i-1). Let Ry be the mean 
arrival rate of messages delivered from all nodes 
and passing through data link k. It can be shown 
that 

nN 4 
R= R= <7 g-9 bytes/sec. 


and is independent of k for all 0 &£ k € Ne1. 


Let the maximum transfer rate of the commu- 
nication line be V bytes/sec, then the bandwidth 
utilization factor is equal to 

su = Re AN 
“9 ~ BavCn-ty 

The inverse ratio of the number of node 
pairs which occupy the data link k during message 
transfers between them to the total number of node 
pairs which require communications in general is 


n = NOW = 1) _ ACN 1) 
Ne/4 N 


which represents the number of communication jobs 
that can be served simultaneously by the loop in 
the limit of its maximum bandwidth. It is a mea~ 
sure of parallelism [2] of the system, and is 
nothing else but the number of servers of the 
system viewed from queueing theory. 

To estimate the total message delay time T, 
we neglect the time spent in repeatedly examining 
the aa and thus obtain 

Z . Pp Z 
1-1 22 
by 5(7- P1) + 24 + q= p> + Zo 


where Z, is the constant service time of queue A, 
Zo is the mean service time of data loop and 
equals to 1/uV sec, 
P2 = NAZ2/n 


The formulae obtained above for calculating 
can also be applied to the cases of single-port 
splitted-bus systems as well as integrated-bus 
systems. The only difference exists in the value 
of n. For integrated-bus systems, n is always 
equal to 1, whereas for single-port splitted-bus 
systems, the value of n can be found as 

N(N = 1) AN(N = 1) 


(N/4)+ 2N—_2 N“ + 8N ~ 8 
Different values of n for the systems under com- 
parison are listed in the following table: 


2=port 
splitted-bus 
loop system 


1~port 
splitted-bus 
loop system 


Number of 
nodes 


The formula for bandwidth utilization is 
rewritten in terms of n for all cases as below: 


pu = NA 
unvn 


A series of simulation experiments have been 
performed for 2=-port splitted-bus, 1-port split- 
ted-bus, as well as integrated-bus systems. The 
same assumptions and conventions were made as in 
theoretical analysis. The calculated and the ex~ 
perimental curves are shown in Fig.2. They coin- 
cide rather satisfactorily. 
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Abstract -- This paper proposes a framework 
for implementing a logic a ea environment 
on a distributed system. ZMOB is one such multi- 
ple microprocessor architecture where the indivi- 
dual microprocessors are connected on a high speed 
conveyor belt. The paper describes how the logic 
programming environment is created on ZMOB. This 
enables us to exploit the high level of parallel- 
ism possible in logic programming. The approach 
and preliminary ren i raises several relevant 
issues in distribute eel eats processing environ- 
ments which are currently being investigated. 


1. Introduction 


The introduction of logic as a conceptual and 
a ee aid in the derivation of programs 
(P ]) has resulted in the_implementation of the 
pogo language PROLOG ([2], P61, and [7]). PROLOG 
interpreters have been designed to be executed on 
a sequential machine i.e., on single processor 
architectures. However the complete separation of 
logic (the specification of the statements of the 
problem to be executed) and control (the order of 
execution of statements ) allows several degrees 
of freedom curate the interpretation of the pro- 
gram, thus providing a natural parallelism that 
could be_ implemented on a parallel architecture 
system ([4]). 


The inherent nondeterminism available during 
a logic program execution can be exploited in the 
following directions: 

1. At any time during execution more than one 
goal node may be selected. 

2. Many literals in a clause can be expanded 
upon. 

3. Many procedures can be invoked for a pro- 
cedure call. 


_Logic programming is distinguished from other 
applicative programming aoneabcs such as LISP 
due to the fact that more than one procedure can 
match a procedure call. This seeming disadvantage 
on a sequential machine can be exploited ina 
highly parallel environment, as some or all match- 
ing procedures can be invoked simultaneously. 


We shall discuss the design considerations 
for implementing a PROLOG-like language on a 
highly sadhana system called ZMOB, at a high 
level. In our initial design the problems associ- 
ated with the parallel execution environment are 
made transparent to a class of users who do not 
want to know about it. For a detailed functional 
Specification and design considerations of ZMOB as 
a parallel problem Soeane system see [1] and [2]. 
Because of length restrictions we assume that the 
reader is familiar with Horn clauses and logic 
programming. See [4] for details. 


2. ZMOB Configuration and Architecture 


ZMOB is multi-microprocessor, distributed 
memory architecture with a high speed "conveyor 
belt" communication facility among the micropro- 
cessors, and a host mini-computer. There can be a 
maximum of 256 microprocessors in the system each 
with a memory capacity of 64K bytes. There are 257 
bins (including one for the host machine) which go 
around on the conveyor belt carrying information 
from peocceee. to which they are attached. 

Ideally speaking the conveyor belt is fast enough 
to service individual microprocessors at the speed 
at which they can access their core memory. For 
more details refer to [5]. 


Communication among the processors can take 
place either on a point to point basis or by 
broadeast, wherein a esngte microprocessor can 
communicate with a set of microprocessors on the 
belt. This broadcast mode of message passing is 
achieved by a pattern matching capability at each 
processor instead of a destination oriented 
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transmission. An exclusive source mode is avail- 
able for communicating on a point to point basis 
without repeated Seg ee pets Each processor has 
a mail stop, which handles the interrupt, receiv- 
ing and transmission of information to and from 
the bins. 


In this section we first describe the decom- 
position of ZMOB into its functionally independent 
components. Then we present the rationale for this 
decomposition and describe the conceptual specif- 
ication of each component. 


There are five clusters of microprocessors 
and the host machine in the ZMOB problem solving 
system. They are: ; 

1. The VAX-host machine. 

2. A set of machines dedicated to problem 
solving, termed Problem Solving Machines (PSMs). 

3. A set of machines dedicated to posuere 
function free ground assertions, the Extensiona 
Database (EDB) machines. Res 

4, A set of machines dedicated to servicing 
poner rules or axioms, termed Intensional Data- 

ase (IDB) machines. ; 

5. An IDB monitor supervising the IDB 
machines. 


The assertions and the axioms of the logic 
program will be distributed among the IDB and EDB 
microprocessors and the actual problem solving 
carried out on several PSM microprocessors simul- 
taneously. Figure 1 shows a small logic program 
aus distribution among different clusters of 


3.1 Problem Solving Machines (PSMs) 


The role of the PSMs is central to the entire 
peooace solving process. The capabilities of the 
SM permit the inherent parallelism available ina 
logic program to be exploited in its entirety. 


The PSM manages the search space. Initially 
the goal node is placed in a PSM. In the process 
of attem ting to solve the goal node, a PSM has 
the capability to dynamical y create new PSMs 
which can independently develop and manage the 
subtrees of the search space. New PSMs are ini- 
tiated if there is non-determinism associated with 
solving an atomic goal. Each PSM is autonomous 
except for the knowledge of the parent-child rela- 
tionship and can, in turn, perform the same opera- 
tions. In addition to managing the goal tree and 
dynamically initiating the new PSMs, the PSM 
selects a subgoal of a conjunction of goals to be 
solved, and selects a clause in the goal tree to 
be expanded next. When a goal assigned to a PSM 
is completely solved it transmits the solution to 
its parent PSM. The parent of the initial PSM is 
the VAX machine. 


Each PSM interacts with IDB machines to 
obtain procedure bodies to generate new nodes in 
the goal tree. It also interacts with EDB machines 
to solve atomic ero and to generate new nodes in 
the goal tree. PSM can work on more than one 
path of the search tree while it is waiting for a 
unifier and procedure body to be returned on some 
other path from either the EDB or the IDB. 


3.2 Intensional Database (IDB) Machines 


We assume that the IDB is relatively small 
and that each Z80A can effectively contain the set 
of all procedure clauses of the program. This 
assumption allows us to replicate the IDB on 
several machines referred to as IDB machines. Each 
IDB machine contains the same information, thus 
making the existence of several IDB machines tran- 
sparent to the PSMs. This replication of IDB 
machines is advantageous on account of the 


"broadcast by pattern" facility available on ZMOB. 


Whenever the PSM sends an atom for expansion 
to the IDB, the first "idle" IDB machine could 
pick up the request and process it. In an exten- 
Sion to this system we plan to consider the han- 
dling of the IDB when the set of procedure clauses 
exceed the capacity of any one machine in the sys- 
tem. The IDB could then be distributed on a set of 
machines in much the same way as is done for the 
EDB, still oe it transparent to PSMs. An IDB 
machine finds all matching procedures requested by 
the PSMs, returning all matching procedure bodies 
at once, or one at a time. 


3.3 Extensional Database (EDB) Machines 
The EDB is the union of all relational tables 
containing ground, function-free assertions. 


We identify a relation with an unique integer 
number obtained from the relation name. This per- 
mits a relation to be readily relocatable onto any 
EDB machine, thus creating a virtual addressing 
facility. The atoms that belong to the EDB in ‘the 
procedure definition are specially marked to 
denote they may be found in the EDB, the IDB or in 
both. Thus PSMs send only the valid requests to 
the EDB and IDB machines. When the request for 
matching some atom P(...) is sent to the EDB, one 
of the machines that contains the relation "P" 
picks up the request. When a relation is distri- 

uted among several EDB machines, the operation is 

Still similar, as one of the EDB machines contain- 
ing the relation acts as a supervisor, thus making 
it transparent to the PSMs. 


3.4 The Host Machine and the IDB Monitor 


_The VAX 780 serves as the host machine and 
ac a an interface between the user and the 
MOB. The user loads ZMOB executable code via the 
VAX and initiates processing. As noted earlier, 
a rae the user from the parallel environment 
fe) ‘ 


The IDB monitor is a simple version of a mon- 
itor and exists in the system to avoid certain 
deadlocking and ibid Soe crat conditions. When all 
IDB machines have their buffers full, no more 
requests may be sent to the IDB machines. The IDB 
monitor keeps track of the availability of buffers 
in IDB machines and when an overload condition is 
detected, it informs all PSMs to take corrective 
action. PSMs, in turn, request responses from the 
IDB machines without making new additional 
requests until buffers become available in IDB 
machines. This situation is detected by the IDB 
monitor and broadcast to all PSMs. 


There are many ek tar ee ways in which one 
could design a parallel logic problem solving sys- 
tem on ZMOB. The design of the system that we have 
outlined above was arrived at through several 
iterations and has numerous advantages. 


The system is modular in design in that each 
processor is dedicated to a specific task: problem 
solving, procedure management, assertion manage- 
ment, or monitoring. Hence each microprocessor can 
operate independently of other microprocessors. 


: _The autonomous nature of each PSM in develop- 
ing its search tree enables it to use necessary 
heuristics in the generation of the search tree. 


The conceptual treatment of procedures and 
assertions is handled uniformly. There is no 
essential difference between the EDB and the IDB, 
as they both serve as distributed memory to the 
System. The distinction was made because of dif- 
ferent unification algorithms employed and poten- 
tial size differences. In database applications 
the size of the IDB is likely to be significantly 
Smaller than the EDB portion. IDB procedures may 
contain functions, whereas EDBs contain only 
function-free ground atomic formulae. 
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The fact that an unedue operation is per- |. 
formed by each processor leads to speed up possi- 
bilities such as preprocessing, and preparation 
for next request at idle time. 


The IDB features the principle, that informa- 
tion is not returned unless requested. This 
allows us to use the IDB as a temporary buffer for 
the PSM and to perform local heuristic ordering on 
the procedures returned by the IDB machine. 


The system is inherently flexible and allows 
the user to choose a particular configuration ( | 
i.e., the number of EDB, IDB and PSMs) especially 
suited for his purpose. 


The relocatable nature of the usage of EDBs 
and IDBs permits them to be loaded onto any pro- 
cessor in the system. 


The functional independence of the system 
components increases the reliability of perfor- 
mance. 


4, Communication Among ZMOB Components 


Any distributed processing environment 
requires that information be exchanged between its 
elements, and hence communication is needed 
between the elements. In an environment like ZMOB 
where the same problem is being distributed among 
several processors there is a need for information 
exchange to carry out the problem solving 
activity. Hence the need for communication primi- 
tives and message formats to keep communication 
overhead low and to maintain a flexible system. 


Two features of the ZMOB architecture, namely 
the broadcast facility using a pattern, and the 
exclusive source mode of communication, have been 
useful in designing the primitives necessary for 
problem solving. The broadeast facility allows a 
Single PSM to converse with a set of IDB and EDB 
machines and the exclusive source mode permits 
large amounts of data to be transferred between 
two machines without handshaking overhead. 


The ZMOB components constantly interact among 
themselves and the VAX for solving a problem. Com- 
munication primitives have been defined to carry 
on the interaction in an efficient and flexible 
manner. 


5. User Interface and Control 

Facilities for the user to interact with the 
roblem solver (machine) is an essential and 
ee part of the overall design aspect of” any 
system. 


VAX acts as the external interface to ZMOB. 
The user accesses ZMOB as a resource from the 
host machine and VAX provides a smooth interface 
relieving the user of the details of ZMOB aspects 
of problem solving. 


The user can be in three different states of 

interaction as a logic problem solver. They are: 

ee eae mode, the active mode, and the execu- 
ion mode. 


In the off-line mode the user is, in general, 
creating a logic program (or deductive relational 
database as the case may be) by using the facili- 
ties available on the host machine. These opera- 
tions are performed outside the ZMOB utilization, 
as they do not need the ZMOB capabilities and dur- 
ing these operations ZMOB is potentially free to 
be used by others. — 


The active mode of operation contrasts with 
the off-line mode in the usage of specific support 
software developed for ZMOB and is executed on the 
host machine. This phase is employed in preparing 
for the execution mode by compiling and converting 
the source code to ZMOB compatible version. The 
user has control over the configuration and other 
aspects of ZMOB in this mode. 


The execution mode is the phase in which the 
user is actually using ZMOB for problem solving in 


its full capability (though transparent to the 
user) with VAX acting as the interface. In this 
mode the VAX channels the user query ( or the 
goal) and other user interaction to ZMOB com-— 
ponents after that for syntax and sequencing 
and formatting them if necessary. Also all com- 
munication from the ZMOB components to the user 
are channeled through VAX. 


The user is to be provided with facilities to 
trace/debug his logic proetals as well as to 
gather specific statistical information from the 
system. The statistical information gathered can 
be used to fine-tune the system to specific needs 
and evaluate the performance of the system under 
various conditions. 


Apart from the interaction to create, com- 
pile, execute and pees logic progr ones the user 
is to be provided with the facility to specify 
and guide the control aspects of logic problem 
solving. The user can exercise several degrees of 
freedom in the choice of the control of a logic 
program. Within a clause literals can be selected 
ranging from left to right literal selection 
(i.e., depth first tree generation) as in PROLOG, 
to a completely re al literal selection. The 
user can specify these through the syntax of the 
axioms of the logic program. 


6. Summary and Scope 


The ZMOB configuration permits a high degree 
of parallelism to take place in solving a problem. 
The use of logic programming as a formalism per- 
mits the exploitation of the parallel capability 
without forcing the user to rewrite his program to 
account for the inherent parallelism that can take 
place. The separation of the logic of the specifi- 
cation of the program from the control permits 
this to be achieved. Thus the system peene 
designed will permit a problem to be executed on a 
eangte machine or on multiple machines within the 
ZMOB configuration. 


: We know of no attempts to exploit parallelism 
in programs based upon a specification of the pro- 

gram in logic. We know of no other approaches 
aa to ours using other computer configura- 
ions. 


; The work described is of considerable 
interest for problems which inherently have a high 
degree of parallelism. These problems arise in 
artificial intelligence and in database systems. 
Sequential computations, although possible to run 
in the system would undoubtedly be executed much 
Slower in our approach. 


Once the system is implemented, there is much 
that must be done. Its effectiveness must be 
evaluated. Simplifications made for this first 
system must be removed. A more flexible control 
capability should be provided to the user so that 
advantage can be taken of his knowledge of the 
problem. Modifications to PROLOG-like languages 
will be necessary to enable communications between 
machines to take place. Considerations must also 
be given to operating system problems and to 
understanding the optimum configuration needed to 
solve a problem. Work on large artificial intelli- 
gence problems and databases would be achieved by 
the availability of disks and drums on the Z80A 
machines. Thus a wealth of research topics remain 
to be accomplished. Work on parallelism and logic 
programming is in its beginning stages. 
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SYSTEM ARCHITECTURE OF A RECONFIGURABLE MULTIMICROPROCESSOR RESEARCH SYSTEM 


Vito A. Trujillo 
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Los Alamos, New Mexico 87545 


Summary 


This paper presents the architecture of an 
experimental multiprocessor system that incor- 
porates a reconfigurable array of microprocessing 
and memory elements. The system is designed 
specifically as a research tool for implementing 
and evaluating parallel-processing algorithms on 
various multiprocessor architectures. Consequent- 
ly, the principal design objective is to provide a 
multiprocessor with fully reconfigurable 
processor~-to-memory and processor-to-processor in- 
terconnections in order to allow direct comparison 
of algorithms for a wide range of multiprocessor 
architectures. Basically, the system is a tightly 
coupled, shared-memory MIMD machine [1-2] that 
supports reconfiguration between processor and 


SYSTEM 
CONTROL 
BUS PROCESSOR NODES 


memory nodes to permit experimentation with common 
memory architectures and with various processor 
network structures such as rings, trees, and 
stars. This experimental computer system is 
currently under development within the Computing 
Division at Los Alamos National Laboratory. 


As illustrated in Figure 1, the Multimi- 
croprocessor Research System consists of numerous 
processor and memory nodes that are directly in- 
terconnected using multiple processor-to-memory 
buses and multiported global memory nodes. The 
multiple bus/multiported memory arrangement func- 
tionally implements a full crossbar switch between 
the processor and memory nodes [3]. This 
multiple-bus architecture allows processor-to- 
processor communications to occur concurrently 
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Multiprocessor Research System system architecture. 


with processor execution from either local memory 
or global memory. 


Three types of processor nodes are included 
within the system: (1) system control processor, 
(2) general floating-point processors, and (3) 
dedicated data transfer processors. The system 
control processor performs system initialization 
(downloading of global memory, configuration con- 
trol, etc.), initiation of parallel processing ap- 
plications code, performance measurements, memory 
error processing, etc. In addition, because the 
multiprocessor is strictly an execution environ- 
ment, the system control processor provides com- 
munication with an external local area network 
that includes development workstations. Each gen- 
eral floating-point processor includes Intel iAPX 
86/87 microprocessing elements [4], 48k bytes of 
local dedicated ROM/RAM, real-time interrupt fa- 
cility, and memory mapping logic that allows 
sixty-one 16k-byte memory segments to be per- 
manently and/or dynamically allocated within the 
system global memory. Each data transfer proces- 
sor is a high-speed controller specifically 
designed for implementing processor-to-processor 
communications by performing data movement between 
global memory segments. 


The system global memory consists of multiple 
memory nodes, each having a 256k-byte RAM array 
accessible from the system control processor and a 
multiported memory controller. The port for the 
system control processor supports downloading and 
memory error reporting functions. The multiported 
memory controller includes interface logic for 20 
ports, memory arbitration logic that implements a 
last-granted-lowest-priority algorithm, and a 
high-speed memory access controller. Memory map- 
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ping logic within each processor node allows each 
memory node segment to be allocated as either 
private or public memory for each processor node. 


The processor-memory interconnection is ac- 
complished with memory mapping logic at each pro- 
cessor node, a multiported memory controller at 
each global memory node, and a multiple bus inter- 
connection backplane that allows an orthogonal ar- 
rangement of processor and memory boards. As il- 
lustrated in Figure 2, an orthogonal packaging 
scheme uses minimal bus lengths in providing com- 
plete physical interconnection between processor 
and memory nodes. Basically, the processor-memory 
interconnection provides fully reconfigurable 
processor-to-memory connections, resolves access 
arbitration when multiple processors are simul- 
taneously accessing a common global memory node, 
and supports mutual exclusion to shared memory. 
Control of shared memory is accomplished through 
an extension of the LOCK mechanism available with 
the iAPX 86/87 microprocessor [4]. 


Processor-to-~processor communication is im- 
plemented indirectly through the processor-memory 
interconnection by specialized data transfer pro- 
cessors that perform data movement between global 
memory nodes. Each data transfer processor in- 
cludes memory mapping logic similar to the general 
floating-point processors; consequently, these 
nodes can access any segment within system global 
memory. In addition, the data transfer processor 
nodes include high-speed control, buffer, and 
translation logic that permit both contiguous and 
noncontiguous memory block transfers. The data 
transfer processors are controlled by linked 
structures within global memory and include mask- 
able interrupt capability indirectly through the 
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system-control processor. Functionally, the 
data-transfer processors can be visualized as mul- 
tiple intelligent buses for interprocessor commun- 
ications. 


Currently, the Multimicroprocessor Research 
System accommodates a single system control pro- 
cessor, 20 processor nodes that can include either 
general floating-point processors or data transfer 
processors, and 32 global memory nodes. However, 
a typical maximum configuration consists of 16 
general floating-point processors and 2 data 
transfer processors, which are sufficient for han- 
dling 16 iAPX 86/87 interprocessor communications. 
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Abstract -- The design of a multimicroprocessor 
system for image processing and pattern recogni- 
tion applications utilizing the 16-bit Motorola 
MC68000 and other off-the-shelf components’ is 
described. This system can be dynamically recon- 
figured to operate in either SIMD or MIMD mode and 
can be used as a building block for the PASM par- 
titionable SIMD/MIMD machine. The results of 
simulations of SIMD operation that were used _ to 


guide the design of the MC68000-based system are 
discussed. The possibilities for overlapped 
operation of the SIMD control unit and processors 


are examined. The system architecture, including 
hardware to interface the off-the-shelf components 
needed for SIMD/MIMD processing, is given. Final- 
ly, simulation studies of the performance of the 
proposed MC68000-based system are presented. 


I. Introduction 

The demand for higher throughput and very Large 
database handling capabilities is forcing computer 
system designers to consider nontraditional archi- 
tectures, notably distributed/parallel systems. 
Architects have proposed microprocessor—based 
large-scale parallel processing systems with as 


many as 214 and 216 processors [e.g., 9, 17] that 
show promise in meeting these data-handling and 
throughput demands. 

Two types of parallel processing systems are 
SIMD and MIMD [4]. SIMD (single instruction 
stream ~- multiple data stream) machines (e.g., Il- 
lLiac IV (31, STARAN (1]) typically consist of a 
set of N processors, N memories, an  interconnec— 
tion network, and a_ control unit. The control 
unit broadcasts instructions to the processors, 
and all enabled ("turned on") processors execute 
the same instruction at the same time. Each pro- 
cessor executes instructions using data from a 
memory with which only it is associated. The in- 
terconnection network allows interprocessor com- 
munication. 

An MIMD (multiple instruction stream - multiple 
data stream) machine also typically consists of N 
processors and N memories, but each processor. can 
follow an independent instruction stream (e.g., 
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C.mmp C22], Cm™ (181). As with SIMD architec- 
tures, there is a multiple data stream and an in- 
terconnection network. 

A Multiple-SIMD machine is a parallel process- 
ing system that can be structured as one or more 
independent SIMD machines of varying sizes (e.g., 
MPP (81). A partitionable SIMD/MIMD machine can 
be structured as one or more independent SIMD 
and/or MIMD machines of varying sizes (e.g., PASM 
[13]). 

SIMD and MIMD parallelism has been shown to. be 
applicable to a wide variety of image processing 
tasks (13, 14, 15]. In this paper, the SIMD mode 
is emphasized; however, the use of full processors 
and the overall organization of the system will 
also allow MIMD operation. The system to be 
presented could be used as a single SIMD machine, 
or as a building block for a emultiple-SIMD 
machine, or a partitionable SIMD/MIMD machine (us- 
ing the techniques described in [13]). 

SIMD algorithm simulations for several 
configurations have been performed (£5, 11]. These 
studies have examined the possibilities for over- 
lapped operation of the control unit and proces- 
sors. Overlapping can be improved as_ additional 
hardware (e.g., latches, buffers) is added at the 
interfaces of these components. Each hardware 
configuration represents a "case" for which rela- 
tive run time performance of assembly Language 
test algorithms was measured. The results of 
these simulations were used as a basis for the 
control unit/processor organization described in 
this paper. A design based on this organization 
and employing currently available off-the-shelf 
components is described and simulated. 

This design work is motivated by two research 
projects at Purdue. One is the development and 
implementation of the PASM (partitionable 
SIMD/MIMD) multimicroprocessor system. The other 
is the study of the use of parallel processing for 
mapping applications. 

In Section II, a_ model 
SIMD/MIMD system is given. A 
Lier SIMD algorithm simulation 
lapping schemes is presented in Section III. In 
Section IV, the design of a multimicroprocessor 
system which incorporates Motorola MC68000 proces- 
sors is described. The hardware organization of 
the control unit, processors, and additional sup- 
port components is discussed in detail in Section 
Vs. It is shown that the interface Logic for the 
microprocessors necessary for SIMD/MIMD processing 


machine 


of the proposed 
summary of our ear- 
studies and over- 


will be minimal; thus the high cost of a custom 
VLSI design can be saved. Ideas for a _ prototype 
patterned on this design. are given. Finally, 
results of simulation studies of the proposed 
MC68000-based machine are summarized in Section 
VI. 


The basic system components of the _ proposed 
machine are a Control Unit (CU) Cincluding its own 


n 
memory), N=2 > processors, N memory modules, and an 


interconnection network. The processors are mi- 
croprocessors that perform the actual SIMD and 
MIMD computations. A memory module is connected 


to each processor to form a processor/memory pair, 
called a processing element (PE). The PEs are 
numbered (addressed) from 0 to N-1. The intercon- 
nection network provides a means of communication 
among the PEs. 

In SIMD mode, the CU fetches instructions from 
its memory, executes the control flow instructions 
(e.g., branches), and broadcasts the data process- 
ing instructions to the PEs. The CU may coordi- 
nate the activities of the PEs in MIMD mode. 

"Functional-block" models of the interactions 
of the CU, PEs, and network will now be presented. 
Later, the hardware used to implement each func- 
tion will be described. 


The CU's functions may be classified into six 
areas Cconsult Figure 1). The numbers’ in 
parentheses in Figure 1 correspond to the com 


ponent classifications given below. 

. (1) The CU execution unit performs program flow 
operations (e.g., loop counting, branching). 
CU memory contains the SIMD instruction 
stream. It also provides data storage for 
the CU execution unit. 

The fetch unit fetches instructions from CU 
memory and routes’ them to the CU execution 
unit, the PEs (via the CU/PE interface), or 
to other specialized CU hardware. 
The CU/PE interface collects PE 
and enable signals 
the PEs. 

The masking operations unit decodes and mani- 
pulates masks. Masks specify which PEs are 
to be enabled or disabled. 

Microprogrammed logic directs the operations 
of the fetch unit, masking operations unit, 
and other specialized CU hardware. Signals 
are generated for system control functions 
(e.g., "bringing up" the CU and PE execution 
units, initializing 1/0 devices). 


(2) 


(3) 


(4) instructions 


and broadcasts these to 


(5) 


(6) 


A PE's functions include the following (consult 
Figure 2). 
(7) In SIMD mode, the PE execution unit accepts 
instructions broadcast by the CU and performs 
computations that process the Local (PE 
memory) data stream. In MIMD mode, instruc- 
tions and data are fetched from PE memory. 
PE memory contains data for the SIMD mode 
operations of the PE execution unit. It also 
contains instructions and data for MIMD mode 
operations. . 
The PE/network interface sends data and rout- 
ing information to and accepts data from the 
interconnection network. 
The condition codes register stores the PE 
execution unit condition codes. The data 
condition select Lines specify which bit or 
boolean function of bits in the register will 
represent the status of the PE. 
Logic controlled by the PE's enable/disable 
Signal ensures that the PE executes no in- 


(8) 


(9) 


(10) 


(11) 
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structions and generates no network conflicts 
while disabled. 


The interconnection network has the single task 
of transferring data among the PEs. It accepts 
data from the "source" PEs at its N input’ ports 
and routes. the data to its N output ports, where 


it is accessible to the "destination" PEs. The 
Generalized Cube network, a network being con- 
sidered for use in PASM for reasons discussed in 


C12], is assumed in the simulations. This network 
consists of n stages of switches and is controlled 
by routing tags. 

In a serial processor, components 4, 5, 


6, 9, 


10, and 11 are either unnecessary or are meaning- 
less. These functions comprise what is known as 
the "overhead due to parallelism." Well-designed 


CU/PE and PE/network interfaces can minimize this 
overhead by overlapping the operations of the CU, 
PEs, and network. Overlapping allows the CU, the 
set of PEs, and the network to perform their own 
tasks, synchronizing only when there is some in- 


formation to be exchanged. Examples of overlap- 
ping are: 
(1) the CU fetching the next instruction in the 


stream or executing CU instructions while the 
PEs are executing an instruction, 


(2) the PEs executing an instruction while a_ set 
of data items is passing through the network, 
and 

(3) the network passing more than one set of data 


jtems from input to output simultaneously. 
In this paper, (1) is analyzed and = simulated. 
(Aspects of (2) and (3) are discussed in [11], but 
are beyond the scope of this paper.) 
In SIMD mode, all of the enabled PEs will exe- 
cute instructions broadcast to them by the CU. A 
masking scheme is a method for determining which 


PEs will be active at a given point in time. An 
SIMD machine may have several different masking 
schemes. 


The general masking scheme uses an N-bit PE en- 
able vector to determine which PEs to activate. 
PE i will be active if the i-th bit of the PE en- 
able vector is a1, for O<i<N. A mask instruction 
is executed whenever a change in the active status 
of the PEs is required. The Iliac IV, which has 
64 processors and 64-bit words, uses general masks 
[16]. However, when N is_ Larger, say 1024, a 
scheme such as this becomes Less appealing. 


The PE address masking scheme (10] uses a 2n- 


bit mask to specify which of the N PEs are to be 
activated. PE address masks are fetched from the 
instruction stream and sent to the masking opera- 


tions unit to be decoded into a PE enable vector 
C13]. This vector is passed to the CU/PE inter- 
face to effect the change in status of the PEs. 
General masks are passed to the CU/PE interface 
unchanged by the masking operations unit. 

PE address masks may be decoded and then mani- 
pulated by the masking operations unit. For exam- 
ple, decoding two PE address masks, "or"-ing them 
together, and using the result as the PE enable 
vector activates the union of the sets of PEs ac- 
tivated by each individual mask (C133. This im 
plies that the masking operations unit can perform 
basic boolean operations on masks and can tem- 
porarily store a number of general and decoded PE 
address masks. 
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Data conditional masks are the result 
forming a test on local (PE) data in an SIMD 
machine environment, where the results of dif- 
ferent PEs' evaluations may differ. As shown in 
Figure 1, the CU receives an N-bit data condition- 
al mask comprised of N one-bit "true/false" data 
conditional results, one result from each PE's 
condition code register. The "true/false" data 
conditional results are stored in the masking 
operations unit for use in activating or deac- 
tivating the PEs. For example, this type of data 
conditional masking was used in PEPE to implement 
the "where" conditional tests (21). 

Certain CU execution unit instructions cause a 
branch based on data conditional mask information. 
For example, “if any" PE meets some criteria (a 
bit in the data conditional mask is "true"), the 
CU execution unit would execute a branch to a dif- 
ferent part of the program. The masking opera- 
tions unit uses the data conditional mask results 
from the PEs to evaluate the "if any," "if all," 
etc., conditions. 


of per- 


III. SIMD Simulation Overview 


A. Introduction 

Our earlier SIMD algorithm simulation studies 
have examined the possibilities for overlapped 
operation of the control unit, processors, and in- 
terconnection network (5, 11]. Overlapping can be 
improved as additional hardware (e.g., latches, 
buffers) is added to the CU/PE and PE/network in- 
terfaces. Six hardware configuration "cases" were 
identified and the relative run time performance 
of assembly language test algorithms was measured 
for each. A summary of the four cases from (5] 
and the two cases using tagged instruction words 
from £11] appear in Subsection B. In Subsection 
C, simulation results from the six cases will be 
summarized and compared. Based on the results, 
one of the cases will chosen’ for the 
MC68000—based design. 


be 


B. Summary of Cases 

In case 1, the CU and PEs are forced to operate 
in lock=-step fashion. That is, while the CU is 
fetching or executing an instruction, the PEs are 
idled, and vice-versa. The STARAN system operates 
in a case 1 mode since there is a single instruc- 
tion register in the control unit which contains 
the currently executing instruction [19]. 

Case 2 allows the CU to fetch instructions or 
execute CU instructions while the PEs are execut- 
ing. However, the CU must wait until the PEs have 
completed their operation before broadcasting the 
next PE instruction. 

A FIFO instruction queue shared by the 


PEs is 


added in case 3. This allows the CU to send PE 
instructions (Copcodes and operands) to the queue 
without having to wait for the PEs to complete 
their current instruction. Associated with each 
opcode/operand pair in the queue, the N-bit 


enabled/disabled status associated with that in- 
struction (the PE enable vector at fetch time) is 
stored. The PE enable vector must be stored since 
CU masking operations (changing the PE 
enabled/disabled status) might be performed before 
the queued PE instruction is actually executed. 
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The Illiac IV and MPP control units and PEPE 
arithmetic control unit use the case 3 overlap 
processing method (2, 19, 21]. ALl three machines 
employ data conditional masking, but the resulting 
masks are stored in the PEs themselves. 

For case 4, a CU instruction buffer is added to 
the case 3 configuration. An implicit assumption 
for this case is that the fetch unit and CU execu- 
tion unit are independent processors. (For cases 
1-3, the fetch and execution units are not neces- 
sarily distinct.) The fetch unit classifies in- 
structions as CU or PE instructions and sends them 
to the appropriate instruction buffer. The fetch 
unit must distinguish branch operations Cincluding 
"if any/if all" branches) by stopping the fetching 
process when these instructions are encountered. 
Branch instructions affect the program counter 
(which the fetch unit maintains to know the "next" 
instruction), so the fetch unit must wait until 
the CU has emptied its instruction buffer and ad- 
justed the program counter based on the result of 
the branch before continuing. Fetching is_ also 
discontinued during masking operations to allow 
the masking operations unit to associate the new 
PE enable signals with subsequently fetched PE in- 
structions. 

For the previous cases, the fetch unit decoded 
each instruction in order to determine where (the 
CU or the PEs) the instruction would be executed 
and the size (number of operands) of the instruc- 
tion. This scheme required that the fetch unit 
have full knowledge of CU and PE instruction types 
and formats. This adds considerable complexity to 
the fetch unit, and would necessitate changes to 
it if either the CU or PE execution units’ were 
changed. An effective solution is to associate a 
tag with each CU memory word, specifying which 
component (CU execution unit, CU microprogrammed 
Logic, or PEs) is to interpret the word. Cases 5 
and 6 correspond to cases 3 and 4, but with fetch- 
ing and buffering by words rather than by instruc- 
tions. Each word (as opposed to each instruction, 
including both opcode and operands, as in_ cases 
1-4) sent to the PE instruction buffer will have 
associated with it an N-bit enable vector. 

The tag scheme just described has several ad- 
vantages. First, the PEs will only be idled when 
the instruction queue becomes empty; an unlikely 
event since the instructions are delivered to the 
queue at the rate of the CU memory access time. 
The time needed to decode the tag is negligible in 
comparison to the time necessary to fully decode 
each instruction and determine how many operand 
words are associated with that instruction. The 
fetch unit no longer requires knowledge of the 
specifics of the PE instruction set since PE in- 
struction words are treated as data. Furthermore, 
a 16-bit Line connecting the CU to the processors 
is needed, as opposed to the 80 bit line if com- 
plete instructions, including operands, were sent 
to the PEs (assuming MC68000 instructions require 
1 to 5 16-bit words). However, the instruction 
opcode must be decoded by the PE execution units 
to determine if "immediate" data operands or ad- 
dress fields are present. This step was previous- 
Ly done by the control unit; data operands were 
associated with the instruction opcode before be- 
ing broadcast to the PEs. 


Cc. Simulation Results 


During simulations performed for several assem- 
bly language test algorithms, the relative run 
time performance of the 6 cases was measured. The 
results of cases 1-4 were presented in £5] but are 
summarized here for comparison with cases 5 and 6. 

The assembly Language instruction set that was 
used for the simulations is of our own design. It 
is similar to instruction sets supported by so- 
phisticated current microprocessors, but augmented 
by instructions for the control unit operations, 
masking operations, and network data transfers. 

The test algorithms are two versions of an im- 
age smoothing algorithm for a 16-PE system smooth- 
ing a 16x16 pixel image [13]. For these algo- 
rithms, each PE contains a subimage of 4x4 pixels. 
In the “original” version of the algorithm, a PE's 
subimage pixels and "border" pixels from adjacent 
PEs are copied to a 6x6 pixel “work area" array. 
Smoothing operations are performed on the pixels 
in the work area. For the “improved” version of 
the algorithm, the "border" pixels and a subset of 
the subimage pixels are copied to the work area. 
In this version, both the work area array and the 
subimage array are accessed during the smoothing 
operations. As will be shown, the original algo- 
rithm performs better for small images, while the 
improved algorithm performs better for Large (more 
realistically~sized) images. Some parameters of 
the algorithms are shown in Table 1. 


Table 1. Test algorithm characteristics. The 
“TOTAL CU" and "TOTAL PE" columns indicate the 
percentages of CU and PE instructions executed. 
"CU IF ANY" is a subclass of "CU BRANCH," which is 
a subclass of "CU TOTAL.” Similarly, "PE NETWORK" 
instructions are included in the “TOTAL PE" clas- 
sification. 


INSTRUCTIONS PE/CU NSTRUCTIONS EXECUTED (PERCENT) 


INSTRUCTION 
EXECUTED TOTAL TOTAL 
RATIO 
RANCH | IF ANY 


fae ew fom fot ae Po Peed 
mproven | 680 | 4.2 | 19 [esOy. [ser | 


Each test algorithm was assembled using a_ spe- 
cial assembler supporting the augmented instruc- 
tion set and simulated using our Purdue SIMD Simu- 
Lation and Timing (PSST) system. An instruction 
execution trace for each simulated algorithm was 
generated to be used Later as input to the timing 
algorithms. A small number of PEs and small image 
sizes were used since the simulations of the SIMD 
system are performed comparatively slowly on a 


serial host computer. Details of the algorithms, 
instruction set, assembler, simulations, and tim- 
ing routines are presented in [11]. 

In preparation for timing the simulations, each 
instruction in the instruction set was classified 
by its constituent operations and characteristics. 
These characteristics include the number of 
operand words to be fetched, the CU execution time 
(for CU instructions), the CU to PE transfer time 
(for PE instructions), the PE execution time and 
network execution time (for PE instructions), 
flags to indicate data conditional mask instruc- 
tions, branch instructions, network instructions, 
and so on. A table of instructions and their 
characteristics was prepared. 

The timing algorithms reference the instruction 
set characterization table and accept input of 
relevant timing information (e.g., opcode’ load 
time (1 cycle), 16-bit operand load time (1 cy- 
cle), buffer enqueue or dequeue time (1/2 cycle), 
mask decoding time (1 cycle)). The interconnec- 
tion network set-up time and network propagation 
delay time were 1 cycle each. Finally, the in- 
struction trace output from the test algorithms 
was used as input to evaluate the timing for cases 
1-6. Note that the same instruction execution 
trace can be used repeatedly for many combinations 
of cases and timing assumptions. For these simu- 
lations, a circuit~switched network whose ports 
are directly connected to PE execution unit regis- 
ters was assumed. No PE/network overlap was con- 
sidered. 

The run time results shown in Table 2 are nor- 
malized such that the case 1 timing = 1.00. As 
shown, the run time of case 3 is_ significantly 
less than those of case 2 and case 1. This was 
expected since the instruction "mix" for these al- 
gorithms is such that PE instructions greatly out- 
number CU instructions and PE instructions occur 
in Large groups, allowing the buffer to do its in- 
tended function. The case 4 run time falls some- 
where between the case 2 and case 3 timing. The 
fact that case 4 performs worse than case 3 for 
these algorithms is not surprising since CU in- 
structions rarely occur in groups (thus under- 
utilizing the CU instruction buffer). Further, 
the percentage of branch instructions performed 
ranges from 70 to 80 percent of the CU instruc- 
tions, thus preventing the filling of the CU in- 
struction buffer in the case 4 configuration. 

In cases 5 and 6 (fetching and buffering by 
words), the time needed to fetch and enqueue in- 
structions, including their operands, is  propor- 
tional to their Length (cases 1-4 had a constant 
time). However, the simplified tag decoding for 
these cases might offset the overhead of the extra 
enqueue/dequeue operations. Comparisons made 
between cases 1-4 and 5-6 may be _ influenced 


Table 2. Normalized run times for cases 176. 


CASES CASE 4 CASE 5 CASE 6 
Bee eee tt or te Ihe ge eee ed 
oricina, | 1.00 | 73 | 7 | «6 | 71 | 70 | .76 | 73 | 04 | 01 
IMPROVED Tit ete 


(a) ‘Enqueue" and "dequeue" operations may not overlap each other. 
(b) "Enqueue" and "dequeue" operations may overlap each other. 


strongly by the simpler (and potentially faster) 
case 5-6 hardware. For example, enqueue and de- 
queue times may be shorter for cases 5 and 6 since 
all queuing functions involve a shorter, fixed- 
size word. The very wide bus assumed in cases 174 
may in reality. be a smaller, time~multiplexed bus, 
thus increasing the CU/PE instruction transfer 
time for those cases. If the fetch, decode, en- 
queue, dequeue, and execution times are assumed to 
be the same as for cases 1-4, cases 5 and 6 per- 
form somewhat worse than the case 2 configuration 
because of the aforementioned factors. Case 3 is 
faster than case 5 because enqueuing and dequeuing 
operations are not done word-by-word. The speed 
advantage of case 3 would be negated if it used a 
16-bit time-multiplexed bus and a slightly slower 
fetch/decode unit. The percentage of instructions 
with operands and the average operand Length (both 
algorithm-dependent parameters) also influence the 
relative performance of the cases greatly. 

The instruction queue sizes chosen for the case 
3-6 configurations also have an effect on the al- 
gorithms’ run time. The minimum size needed was 
seven words for case 3, six for case 4, and three 
for cases 5 and 6. A detailed analysis of the 
minimum PE instruction buffer sizes required to 
get the same overall execution time the infinite 
buffer Cassumed in Table 2) would provide is given 
in (11). 

Based on the simulation results obtained for 
the SIMD mode, the case 5 configuration has been 
chosen. Case 3 was not chosen because of the more 
complex fetch unit design and the very wide CU/PE 
bus width requirement. Assuming the use of stan- 
dard microprocessors, the case 3 configuration un- 
necessarily duplicates the instruction decoding 
function of the PEs. A narrower, time-multiplexed 
CU/PE bus could be implemented with case 3, but 
this approach would Likely negate the speed advan- 
tage gained by buffering instructions as a unit. 
Furthermore, standard microprocessors accept in- 
structions word-by-word. Case 5 simplifies the 


design of the fetch unit considerably since tags 
associated with each memory word indicate’ that 
word's destination. The fetch unit requires no 


knowledge of either the CU or PE execution unit's 
processor instruction set. The tagged memory 
scheme also allows the instruction complement of 
the microprogrammed hardware to be developed in- 
dependently. The PE instruction queue and_ bus 
width of 16 bits is quite manageable. The case 5 
queue may be longer since it is word-by-word, but 
has a much narrower width that is always fully 
utilized. Simulation results of the MC68000-based 
system are presented in a later section. 
IV. The MC68000-Based PE 

Referring to the model of a PE (Figure 2), con- 
sider incorporating the Motorola MC68000 processor 
as the PE execution unit. The processor itself, 
256K-bytes of PE memory, and some simple Latches 
(PE/network interface, condition code register) 
and logic can easily fit ona single physical 
board. The organization of the model was _ chosen 
carefully so that the number of wires running 
between the CU and PEs is minimized. The consoli- 
dation of specialized hardware in the CU makes 
each PE board simpler and cheaper to construct. 
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The MC68000 is a_ state-of-the-art 16-bit mi- 
croprocessor [20, 7]. Internally, it can operate 
on bit, byte, word (16-bit), and Long (32-bit) 
data formats. Its fast cycle time and Large ad- 
dress space (currently 24-bit addresses) make it 
ideal for image processing applications where 
speed and large data set handling capabilities are 
a must. Its very regular instruction set, many 
addressing modes, and suitability to high-level 
Language operations make it easy to program. 
While some of the MC68000's functions go unused 
when it operates in SIMD mode (e.g., branch and 
control operations), these functions are essential 
for MIMD "stand-alone" processing. While the 
MC68000 is not quite as "powerful" as the Illiac 
IV (33 or PEPE [21] PE, it is considerably more 
complex than the STARAN [1] or MPP C2] processors. 

Each PE will be able to address any of three 


logical address spaces. Physical PE memory ad- 
dresses (both ROM and RAM addresses) will 
represent one space. Addresses of I/0 ports will 


be contained in the second space. The PE instruc- 
tion queue (for the case 5 configuration) will 
have addresses in the third space. Initially, all 
PEs will be enabled, and have their internal pro- 
gram counter set to the address of the beginning 
of the PE instruction queue space. When the PEs 
try to fetch the first SIMD instruction, the ad- 
dress sent out by each of the PE execution units 
will be decoded by the "address decoding Logic” as 
a reference to the PE instruction queue space. 
This logic will send an "instruction request sig- 
nal" to the FIFO instruction queue. When all PEs 
request an instruction, the buffer acknowledges 
the requests and puts an instruction word on all 
the PE data buses. Each PE decodes the _ instruc- 
tion and performs the operation or requests addi- 
tional operand words. If the logic determines 
that a PE memory or 1/0 device address is being 
referenced, the operation is performed normally. 

In SIMD mode, the PE program counter’ serves 
only to identify a request for an instruction 
word. The actual value of the PE program counter 
is irrelevant, as long as it references an in- 
struction in the PE instruction queue space. How- 
ever, the program counter is incremented automati- 
cally upon receiving an instruction from the PE 
instruction queue. Eventually, the program 
counter will near the end of the instruction queue 
space and will need to be reset. The instruction 
queue address space is made large so that the 
overhead of resetting the program counter is 
minimal. 

When the PE enable vector specifies that a PE 
is to be disabled, the address decoding logic in 
that PE continues to send an instruction request 
signal to the PE instruction queue. However, the 
acknowledgement and data word from the queue is 
intercepted by the logic so that the PE execution 
unit never "sees" the instruction. When the PE 
execution unit is re-enabled, processing can con- 
tinue. 

In order to avoid internal modifications to the 
PE execution unit, PEs will communicate via the 
interconnection network using a sequence of I/0 
port read and write operations. A PE specifies 
where its data is to be routed by computing the 
address of the destination processor (PEs are ad- 
dressed 0 to N-1). The address is written to an 


external (nt1)-bit "network set" Latch (the "ex- 
tra" bit will be described later). This action 
instructs the network to set switches to make a 
connection with the destination address [6]. Data 
transmissions will occur through two 16-bit exter- 
nal data Latches called Data Transfer Registers 
(DTRs) (£133. One latch is connected to the net- 
work input (DTRin), and the other to the network 
output (DTRout). The data to be transmitted is 
written to the DTRin Latch. Finally, a control 
word is written to an external 1-bit "network 
transfer" register, signaling to the network that 
the transfer should be made. Subsequent transfers 
route items to the same destination until the 
"network set" Latch is modified. In SIMD mode, 
all PEs do these operations at the same time. In 
MIMD mode, PEs use the network asynchronously. 

At the destination PE, the network sets a flag 
indicating that the DTRout contains newly- 
transferred data and may be read. When the PE at- 
tempts to read DTRout, specialized Logic examines 
this flag. If the PE attempts to read DTRout 
prematurely (the flag is not yet set), the PE is 
made to wait until the network has passed the da- 
ta. For this reason, other processing is often 
done during network transfers to maximize over- 
lapped operation of the PEs and network. In MIMD 
mode, a PE might send data faster than the desti- 
nation PE requires it as input. In this situa- 
tion, the network-generated signal flag might be 
used to interrupt the receiving PE and instruct it 
to buffer the incoming data. 

When a PE is disabled, logic insures that the 
"network set" data does not create "conflicts" in 
the network switch settings. The "extra" bit in 
the "network set" Latch is used to indicate that 
this network input should be ignored. A disabled 
PE may receive data normally since the DTRout is 
unaffected by the enabled/disabled status of the 
PE execution unit. When re-enabled, the PE can 
read DTRout. 

When a data conditional mask is needed, PE 
structions to evaluate the PE data condition are 
executed. Then the PEs write their status regis- 
ter (which contains the processor condition codes) 
to the condition codes register. Logic associated 


in- 


with the condition codes register can generate 
eight different conditional tests (e.g., equal, 
not equal, positive, overflow). Data condition 


select lines from the CU specify which of the con- 
ditional tests is to be returned to the masking 
operations unit as that PE's component of the data 
conditional mask. 

From time to time, the CU fetch unit will en- 
queue a JUMP instruction to the beginning of the 
PE instruction queue space. This is to. prevent 
the PE program counters from entering a different 
address space. The mechanism that the fetch unit 
uses to perform this function will be described 
later. When the machine is to change from SIMD to 
MIMD mode, the fetch unit broadcasts a JUMP in- 
struction to some address within the PE memory 
space. Typically, this would be the beginning of 
a program stored in ROM that would initialize the 
PE operating system for MIMD processing. While in 
MIMD mode, PEs do not access the PE instruction 
queue space since MIMD instructions and data are 
contained entirely within the PE memory. When the 
PE is ready to revert to SIMD mode, it jumps to 
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the beginning of the PE instruction queue 
When all 
continues. 


space. 
the PEs have done this, SIMD processing 


V. 


CU Architecture Details 


There exist no off-the-shelf processors’ that 
can perform all of the functions of the control 
unit at a speed sufficient to keep the PE execu- 
tion units busy. Therefore, fast microprogramm- 
able bit-slice components will be used for all of 
the CU specialized functions. These functions in- 
clude the operations of the fetch unit, masking 
operations unit, and CU/PE interface. In order to 
simplify the programming of the system and to make 
data formats uniform throughout, the CU execution 
unit will also be an MC68000 processor. For com- 
parison, the execution component of the ILliac 
control unit is a powerful 64-bit integer/floating 
point processor [3]. The MPP “main control" and 
the PEPE arithmetic control unit are also quite 
sophisticated (21, 2]. By contrast, the STARAN 
execution unit and MPP "PE control" unit consist 
of only a few dedicated registers for loop count- 
ing and handling of "associative array field 
pointers" (19, 2]. 

The speed at which the bit-slice fetch unit can 
fill the PE instruction queue to capacity, and the 
Large ratio of PE to CU instructions in algorithms 
programmed so far indicates that the MC68000 will 
be an acceptable CU execution unit. When the sub- 


set of MC68000 instructions actually used in nor- 
mal CU execution unit operations is defined 
through actual use and further simulation, and if 


there is a need for more speed, the MC68000 could 
be replaced with a bit-slice machine. 
CU memory will be comprised of 20-bit words. 


The most significant four bits will be interpreted 


by the microprogrammed Logic portion of the fetch 
unit as a destination for the remaining 16. 
The CU fetch unit will contain two registers: 


the Fetch Unit Program Counter CFUPC) and the PE 
Space Counter (PESC). The FUPC gives the address 
of the next instruction to be fetched from CU 
memory. The CU execution unit program counter 
serves only to identify a request for an instruc~ 
tion word. The actual value of the CU execution 
unit program counter is irrelevant, except when 
branch instructions are executed. The FUPC and 
the CU execution unit program counters must be 
equal before a branch instruction is executed 
since computations using the program counter will 
be done (e.g., relative branches). 

The PESC begins at zero and is incremented each 
time a word iS enqueued in the PE instruction 
queue. When the PESC reaches a threshold value 
close to the size of the PE instruction queue 
space, the fetch unit enqueues a JUMP instruction 
before the next PE instruction. (The first word 
of a PE instruction has a special tag.) The JUMP 
instruction causes the PE program counter to be 
reset to the beginning of the PE instruction queue 
space (see Section IV). When the JUMP instruction 
is enqueued, all PEs are temporarily enabled. The 
PESC register is also reset to zero. 

The 4-bit memory word tags will 


specify what 


sequence of actions the microprogrammed Logic is 
to take. Examples of these actions are enqueuing 
a PE instruction opcode or operand, sending a CU 


instruction to the CU execution unit, mask decod- 


ing, and-ing and or-ing of masks, PE data condi- 


tion selection, initialization of the CU execution 
unit, masking operations unit, PEs, or I/0 dev- 
ices, etc. 

The CU fetch unit never operates at the same 
time the CU execution unit is performing branch 
instructions or while the masking operations unit 
is operating. The CU execution unit may modify 
the program counter which the fetch unit maintains 
to know the "next" instruction. The masking 
operations unit may modify the PE enable vector 
-which must be associated with each enqueued PE in- 
struction. 

The masking operations unit maintains a_ stack 
of N-bit masks generated by nested "where" condi- 
tionals and PE address masks [11]. The PE enable 
vector that is currently on the top of the stack 
is enqueued whenever a PE instruction word is en- 
queued. The details of the stack operations, 
stack hardware, and the interplay between SIMD 
programs and masks are discussed in [11]. 

The PE instruction queue (CU/PE interface) is a 
high-speed I/0 buffer N+16 bits wide and 32 words 
long. This Length allows about ten average PE in- 
structions to be queued. A head and tail pointer 
indicate the position of the next word to be de- 
queued or enqueued, respectively. The buffer de- 
queues a word if nonempty and when all PEs_ make 
the request (Cinactive PEs are always "request- 
ing"). The fetch unit may enqueue a word provided 
the queue is nonfull. 

In order for the instruction queue to be use- 
ful, the total time to fetch an average instruc- 
tion, decode its tags, and enqueue its constituent 
words should not exceed the time needed by the PE 
to execute that instruction. Given that 
2900-series microprogrammable bit-slice components 
have a cycle time of 200 nanoseconds vs. the 
MC68000 = basic memory cycle time of 500 
nanoseconds, there should be no problem in filling 
the PE instruction queue to keep the PEs “satis- 
fied." If the queue is sufficiently large, the ex- 
ecution of several consecutive CU execution unit, 
masking, or control instructions should not empty 
the queue and "starve" the PEs. 

For a prototype system of size N = 16 or 32 
PEs, the MC68000 execution unit could be used to 
simulate some of the CU operations in software and 
monitor the PEs. 


The Large address space of the MC68000 could be 
used to access any part of up to thirty-two 256K- 
byte PE memories if the hardware is so arranged. 
This scheme would be most useful in a prototype: 
the CU execution unit could load and unload PE 


memories, monitor the behavior of individual PEs, 
and so on. (A real system would not use this 
scheme because of speed and memory contention 


problems.) 
vi. MC68000 Simulation Results 


The simulation of the MC68000-based system was 
carried out using the same techniques as described 
earlier. However, these simulations required the 
writing of new SIMD algorithms in the MC68000 in- 
struction set, a specialized version of an MC68000 
assembler, and new PSST simulation programs. The 
PSST timing algorithms were Largely unchanged, but 
a new table of instruction timing characteristics 
had to be prepared. 

The PSST simulator consists of two main 
coroutines: the simulation of an MC68000 processor 
and the simulation of the CU microprogrammed'  Log- 
ic. The actions of the fetch unit and masking 
operations unit are included in the CU micropro- 
grammed logic simulation. When the CU execution 
unit is to be activated, a copy of the "CU data 
area" is passed to the MC68000 simulator and pro- 
cessing is initiated. When a PE is to be activat- 
ed, a copy of the appropriate "PE data area" is 
passed to the MC68000 simulator. The action of 
the CU/PE interface (case 5: overlapping of the 
instructions) is simulated by the timing routines. 

The PSST simulator for the MC68000 system is 
Largely complete although it Lacks BCD arithmetic 
operations, trap and exception processing, inter- 


rupts, and MIMD operation (the asynchronous in- 
teraction of the PEs). It also cannot detect in- 
terconnection network "conflicts." Major effort 


will be required to implement interrupts and MIMD 
operation in both the simulation and timing 
routines. 

Two versions of the SIMD image smoothing algo- 
rithm for a 16-PE system were simulated. The al- 
gorithms are identical to those described in Sec- 
tion III. Simulations of both algorithms were 
performed for a variety of image sizes ranging 
from 16x16 to 128x128 pixels. The complete image 
can be superimposed onto an array of 4x4 (=16) PEs 


For example, the masking opera- such that each PE processes a subimage of 4x4 to 
tions unit and CU/PE interface could be implement- 32x32 pixels. 
ed in software (but at a cost in system speed). Table 3 compares the simulation and timing 
Table 3. Comparison of smoothing algorithm simulation and timing 
characteristics. The "original" algorithm run time results are normal- 


ized to 1.00. 
are performed with N=16 PEs. 


The internal cycle time is 250ns. 


ALL of the simulations 


ORIGINAL ALGORITHM 


IMPROVED ALGORITHM 
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results for the two smoothing algorithms. The run 
time has been normalized such that the original 
algorithm run time=1.00. These results indicate 
that the original algorithm is more efficient for 
small subimages (fewer than 12x12 pixels per PE) 
than the “improved” algorithm. The improved algo- 
rithm would be used for real-world-size problems. 
The actual algorithm execution time can be cal- 
culated for a given algorithm/image size pair by 
multiplying the number of cycles by the cycle 
time. Assuming a standard 8MHz MC68000 processor, 
the internal cycle time is two clock cycle times, 
or 250ns. Thus, a 128x128 (=16K) pixel image can 
be smoothed by the 16-PE system in about 31ms. 
Note that this is algorithm execution time. The 
simulations do not include data load/unload time 


between primary and secondary memory (which will 
be highly implementation dependent, e.g., see 
C13]). 

The 128x128 pixel simulation required about 16 


minutes of VAX cpu time. This corresponds to an 
average execution rate of over 19 SIMD instruction 
per second of cpu time. Recall that the simulator 
executes a single PE instruction 16 times, once 
for each PE. Somewhat less than half of the cpu 
time may be saved if the "PE memory dump" follow- 
ing the simulation is inhibited. The writing of 


128° numbers to disk files (for verification of 
the smoothed output) takes a considerable amount 
of time. 

A "serialized" (single PE) algorithm was con- 
structed from the original parallel algorithm to 
determine the ''speedup." The serial algorithm 
operates on the entire image (rather than a subim- 
age) and thus does not need to perform masking or 
inter-PE transfer operations. When the number of 
masking and transfer operations per pixel pro- 
cessed (parallel overhead) is high, the parallel 
algorithm will not be very efficient. If no over- 
head is involved, the parallel algorithm should 
execute 16 times faster on a 16-PE machine than on 
a 1-PE machine. As shown in Table 4, the parallel 
algorithm performs relatively poorly for small su- 
bimage sizes, but approaches a perfect speedup for 
“real-size" tasks. 


Table 4. Determination of the speedup factor of 
the original parallel algorithm. ALL of the simu- 
lations are-performed with N=16 PEs. 


TOTAL IMAGE SIZE SUBIMAGE SIZE _SERIAL TIME — TIME 
| (PIXELS) (PIXELS PER PE) PARALLEL TIME TIME 
ptose |e Tce 


ee ee ee 13.16 


64x64 16x16 14.32 


128x128 


15.56 


the MC68000 divide 


in- 
is executed once per pixel pro- 


It was observed that 
struction, which 
cessed to scale the result, accounts for roughly 
35% of the total run time. The divide instruction 
requires about 75 machine cycles as compared to a 
typical add instruction requiring about 4 cycles. 
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If better run times were necessary, the algorithm 
could be restructured to smooth a window of eight 
nearest-neighbor pixels (as opposed to nine) and 


scale the data by shifting the result right by 
three bits. A typical 3-bit shift requires 7 cy- 
cles, or about 10% of the divide cycle time. How- 
ever, a load and add cycle (about 7 cycles) is 
saved since only eight pixels are used in the win- 
dow. Thus a 35% improvement can be gained by re- 
placing the divide instruction. 

The 75 cycles for a divide instruction is_ the 


maximum instruction time; the actual time required 
is data-dependent and is not considered by the 
PSST timing routines. If some PEs finish the in- 
struction before others, they will be made to wait 
until all the PEs have finished. Recall that a PE 
instruction is dequeued from the FIFO buffer only 
when all PEs make the request for the next in- 
struction. However, if all of the PEs finish the 
division before the 75 cycle maximum, the hardware 


will be able to exploit this and release the next 
instruction to the PEs. 
The simulation results presented may be extra- 


polated to determine timings and speedups for oth- 
er machine and/or image sizes. The run time of an 
algorithm depends on the relative sizes of the 
machine and the image, or equivalently, the subim- 
age size in pixels per PE. For the smoothing ex- 
amples, a minimum machine size of 4 PEs is neces- 
sary and sufficient so that all relevant inter-PE 
transfer and masking instructions are _ included. 
For example, a 4-PE SIMD machine can smooth an 8x8 
pixel image in the same amount of time as a_ 16-PE 
machine can smooth a 16x16 pixel image. In each 
case, a PE operates on a subimage of 16 pixels. 
Similarly, since 16 PEs can smooth a 
128x128 (=16K) pixel image in 31ms, a 64-PE system 
of the same design and using the same algorithm 
could smooth a 256x256 (=64K) pixel image in 31ms. 
(For larger machines, the number of stages in the 
Generalized Cube network will increase; however, 
assuming that the propagation delay of the network 
is overlapped with PE operations, the impact of 
the added stages is negligible.) In general, in- 
creasing the number of PEs by a factor of four al- 
Lows four times as many pixels to be processed in 
the same amount of time. However, this does not 
mean that processing four times as many pixels 
will take four times as long for a fixed machine 
size. In the Latter case, the fixed and variable 
costs of performing the particular algorithm must 
be taken into account. 
VII. Conclusions 

Based on the results of past simulation = stu- 
dies, the design of an extensible SIMD/MIMD 
machine based on state-of-the-art microprocessors 
and off-the-shelf components was developed. The 
interface logic necessary for SIMD/MIMD processing 
was found to be minimal. Thus the high cost of 
designing and fabricating a custom VLSI PE has 
been saved. The architecture could be used as a 
single SIMD/MIMD machine, or as a building block 
for a Larger multiple-SIMD or partitionable 
SIMD/MIMD system using the techniques described in 


C131. Also, the design presented is easily modi- 
fied even after it is constructed since the CU 
does not decode any PE instructions. This is 


especially important since the MC68000 processor 
is not yet in the final stages of its evolution. 
The use of an MC68000-based control unit in a pro- 
totype has also been shown to be highly desirable. 
In a final design however, many of the CU func- 
tions will have to be implemented using bit-slice 
technologies. 

Given these considerations, it appears that a 
powerful SIMD/MIMD system having at least 128 pro- 
cessors could be built without encountering severe 
physical hardware restrictions (e.g., space, 
power, and cooling requirements, bus Length res- 
trictions), and at a reasonable cost using current 


technology. Further, we have working SIMD machine 

simulators and trace-driven timing analysis algo- 

rithms that can be used to evaluate additional 

SIMD programs for image processing and pattern 

recognition in order to study various system ar- 

chitecture features. 
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Abstract - Many proposed large-scale parallel processing 
systems (e.g., PASM) can operate in multiple-SIMD 
mode. The multiple control units in such a system 
share a common secondary storage for programs. The 
control units use paging to transfer programs to their 
primary memories. One design problem is determining 
the optimal service rate for the secondary storage, where 
the ‘‘optimal” is characterized by maximum processor 
utilization. The problem is approached by developing a 
queueing network model for the PASM control system 
memory hierarchy. Based on assumed values for param- 
eters which characterize the expected task environment, 
an optimal service rate is derived from the model. The 
values of the parameters in the model can be varied to 
determine the impact these changes would have on sys- 
tem performance. Simulation results verifying various 
aspects of the model are presented. The results are 
shown to apply to the general model for a multiple- 
SIMD machine. 


I. Introduction 


A multiple-SIMD system (e.g., [9]) is a parallel pro- 
cessing system which can be dynamically reconfigured to 
form one or more independent SIMD (single instruction 
stream - multiple data stream) [5] machines of varying 
sizes. Handling the memory management problem for 
the multiple control units is an issue which must be con- 
sidered in the design of multiple-SIMD systems. One 
possible solution to the problem is the use of virtual 
memory [1]. If virtual memory is used in a multiple- 
SIMD system with common secondary storage for the 
multiple control units it 1s necessary to determine the 
optimal page request service rate for the secondary 
storage. The optimal is characterized by maximum util- 
ization of the processors. 

PASM is a _ multimicrocomputer system being 
designed at Purdue University for a variety of image 
processing and pattern recognition problems [10]. It is 
the use of PASM in the multiple-SIMD mode of opera- 
tion which motivates this study. In this paper a queue- 
ing network model is developed for the memory hierar- 
chy of the multiple control units in PASM and is 
analyzed to determine the optimal page request service 
rate for the secondary storage. The optimal service rate 
for the secondary storage can be determined from the 
average system page request rate using heuristics for 
serial multiprogrammed systems as a guideline. The 
average system page request rate can be determined by 
making assumptions about the task environment (e.g., 
number of processing elements which tasks require and 
the estimated execution time of the tasks). The system 
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idle time which results from the multiple control units 
waiting for page requests to be serviced is determined 
for the case where the system page request rate deviates 
from the average rate which was used to determine the 
optimal secondary storage service rate. The values of 
the parameters in the model can be varied to determine 
the impact these changes would have on system perfor- 
mance. Simulation results verifying various aspects of 
the model are presented. The model and analysis is 
related to a general model for a multiple-SIMD machine. 

Section II is an overview of the PASM multimicro- 
computer system. Terminology is defined in Section III. 
In Section IV a queueing network model for the PASM 
control system memory hierarchy is developed. The 
average system page request rate for PASM is deter- 
mined in Section V. In Section VI the optimal service 
rate for the secondary storage is determined. Opera- 
tional analysis [3] is used to determine the average idle 
time for the multiple control units in Section VII. 
Simulation results are presented in Section VIII. In Sec- 
tion IX the analysis is related to the general model for a 
multiple-SIMD machine. 


ii. PASM Overview 


PASM, a partitionable SIMD/MIMD machine, is a 
large-scale dynamically reconfigurable multiprocessor 
system [10]. It is a special purpose system being 
designed to exploit the parallelism of image processing 
and pattern recognition tasks. PASM can be partitioned 
to operate as many independent SIMD and/or MIMD 
(multiple instruction stream - multiple data stream) 
machines of. varying sizes. PASMOS is the operating 
system for PASM. 

A block diagram of the basic components of PASM 
is given in Figure If.1. The Parallel Computation Unit 
contains N = 2" processors, N memory modules, and an 
interconnection network (see Figure II.2). The Parallel 
Computation Unit processors are microprocessors that 
perform the actual SIMD and MIMD computations. 
The Parallel Computation Unit memory modules are 
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Figure I1.1: Block diagram overview of PASM. 
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Figure IL.2: PASM Parallel Computation Unit. 


used by the Parallel Computation Unit processors for 
data storage in SIMD mode and both data and instruc- 
tion storage in MIMD mode. The interconnection net- 
work provides a means of communication among the 
Parallel Computation Unit processors and memory 
modules. The System Control Unit is a conventional 
machine, such as a PDP-11, and is responsible for the 
overall coordination of the activities of the other com- 
ponents of PASM. 

The Memory Storage System provides secondary 
storage space for the Parallel Computation Unit data 
files in SIMD mode, and for both the Parallel Computa- 
tion Unit data and program files in MIMD mode. The 
Memory Management System controls the transferring of 
files between the Memory Storage System and the 
Parallel Computation Unit memory modules. It 
employs a set of cooperating dedicated microprocessors. 
Multiple storage devices are used in the Memory Storage 
System to allow parallel data transfer. | 

The Micro Controllers (MCs) are a set of micropro- 
cessors which act as the control units for the Parallel 
Computation Unit processors in SIMD mode and orches- 
trate the activities of the Parallel Computation Unit 
processors in MIMD mode. There are Q = 24 MCs. 
Each MC controls N/Q Parallel Computation Unit pro- 
cessors [7]. A virtual SIMD machine (partition) of size 
RN/Q, where R = 2 and 1 <r<q, is obtained by 
loading R MC memory modules with the same instruc- 
tions simultaneously. Similarly, a virtual MIMD 
machine of size RN/Q is obtained by combining the 
efforts of the Parallel Computation Unit processors and 
R MCs. Q is therefore the maximum number of parti- 
tions allowable, and N/Q is the size of the smallest par- 
tition. Possible values of N and Q are 1024 and 16, 
respectively. 

Each MC processor is attached to a memory module 
which consists of a pair of memory units. The second 
memory unit may be used to load the initial pages of 
the next task while the current task is executing 
instructions from the first memory unit. In this analysis 
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the steady-state condition is considered, 1.e., the effect 
of preloading is ignored. Since a task which 1s executing 
uses only one memory unit, the paging analysis dose not 
consider the double-buffering. In SIMD mode, each MC 
fetches instructions from its memory module, executing 
the control flow instructions (e.g., branches) and broad- 
casting the data processing instructions to its Parallel 
Computation Unit processors. In MIMD mode the MCs 
may be used to help coordinate the activities of their 
Parallel Computation Unit processors. 

SIMD programs are stored in the Control Storage 
which is the secondary storage for the MCs (see Figure 
I].1). The loading of SIMD programs from the Control 
Storage into the MC memory units is controlled by the 
System Control Unit and Control Storage Controller. 
The Control Storage Controller is a dedicated micropro- 
cessor which manages the Control Storage file system. 
When large SIMD tasks are run, i.e., SIMD tasks which 
require more than N/Q processors, more than one MC 
executes the same set of instructions. Therefore each of 
the MC memory units must be loaded with the same set 
of instructions. The fastest way to load several MC 
memory units with the same set of instructions is to 
load all of the memory units at the same time. This 
can be accomplished by connecting the Control Storage 
to all the MC memory units via the MC Memory Sys- 
tem Switch. A block diagram of the control system 
memory hierarchy is given in Figure II.3. The MC 
Memory System Switch is controlled by the Control 
Storage Controller. All interaction between the Control 
Storage and the System Control Unit is done through 
the Control Storage Controller. (In [10] an enhanced 
scheme for connecting the MC processors to the MC 
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Figure IL.3: Overview of PASM control 


memory hierarchy for Q=16. 


system 


memory modules is also considered. The analysis in this 
paper also applies directly to that scheme.) 

For some applications of PASM it is possible that 
the SIMD programs may be too large to fit into the pri- 
mary memory (memory unit) of a given MC. Virtual 
memory may be used to give the programmer the illu- 
sion that the primary memory is much larger than in 
reality. There are two methods for implementing vir- 
tual memory: paging and segmentation [1]. In this 
paper, paging is considered. To implement paging as a 
part of the PASMOS operating system, the system must 
provide a translation mechanism to map the virtual 
address, which is used by the programmer, to a physical 
address, which is used by the system. In PASM, the 
translation is done by the MCs. When the page is not 
in the MCs primary memory (memory unit), it has a 
page fault. When an MC has a page fault, the MC 
sends a request on the request bus (see Figure II.3) to 
the Control Storage Controller which then services the 
request by locating the page in the Control Storage and 
sending it to the appropriate MC memory units through 
the MC Memory System Switch. 

Consider the case where an SIMD task requires more 
than one MC. When a page fault occurs for the task, 
all of the MCs which are executing the task have a page 
fault. Since the faulted page is the same for all of the 
MCs which are executing the task, the page may be 
broadcast to all of the MC memory units simultaneously 
through the MC Memory System Switch. Hence, only 
one of the MCs must report the page fault ta the Con- 
trol Storage Controller, i.e., only one page request is 
generated. The MC which reports the fault can always 
be the same (e.g., logical 0 in the virtual SIMD machine) 
or may vary from one page fault to the next. 

When PASM is operating as a number of indepen- 
dent virtual SIMD machines of varying sizes, the MCs 
are in effect a virtual MIMD machine. The secondary 
storage to this MIMD machine is the Control Storage. 
The model for the control system memory hierarchy is 
developed in Section IV. 


lil. Terminology 


In this section the terminology which is used in the 
analysis is defined. Vzrtual time is defined to be the 
time that a processor is executing a task not including 
the time that it is idle waiting for page faults to be ser- 
viced or the time which a task is not assigned to it. Real 
tame includes all time. The real page fault rate, referred 
to as the “page fault rate,” 1s defined to be the rate at 
which page faults occur over real time, 1.e., the number 
of page faults for the processor divided by the real time. 
The vertual page fault rate is defined to be the rate at 
which page faults occur over virtual time, i.e., the 
number of page faults for the processor divided by the 
virtual time. The real page request rate, referred to as 
the ‘‘page request rate,” and wirtual page request rate are 
the rates which pages are requested from the secondary 
storage over real time and virtual time, respectively. 

For PASM, the virtual page request rate for MC, is 
vy. The MC utilization is the fraction of time an MC is 
executing. The utilization of MC; is U;. The (real) page 
request rate for MC, is \4; = Uv, The (real) system 
page request rate is the combined page request rates of 
all the MCs and is denoted by 4,,.. 


IV. Model 


In this section a queueing network model is 
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Figure IV.1: 


developed for the PASM control system memory hierar- 
chy. The interaction between the MCs and the Control 
Storage can be modeled by the two-station cyclic net- 
work in Figure IV.1 [3]. The MC subsystem contains 
the Q MC processor-memory unit pairs. Since only one 
of the MCs in a group of MCs executing a task makes 
page requests, there are only TT MCs making requests 
where T is the number of tasks executing. Hence, the 
network is closed with T customers. The average time 
between page requests for each of the MCs making page 
requests is 1/A, where A is assumed to be the virtual 
page fault rate for all tasks. The page request rate of 
the MC subsystem is the system page request rate d,,.. 

The Control Storage subsystem services the page 
requests made by the MC subsystem. The service rate 
of the Control Storage subsystem is p. The service 
queue at the Control Storage uses a FIFO queueing dis- 
cipline. The number of requests in the Control Storage 
subsystem at a given time is 7, whereO <7» < T. The 
throughput of the network is Xp. 


V. System Page Request Rate 


In this section the queueing network model is 
analyzed to determine the average system page request 
rate. If all of the MCs are executing a different task (Q 
tasks executing), then the virtual page request rate for 
each of the MCs is A, Le, vy; = A, where 0 <i < Q. 
Therefore, the system page request rate is: 


Naga = oN = Sum = Sua = ASU, 
i= i= i= i= 


a con- 
then 


Using the simplifying assumption that U, = U,,., 
tae MC utilization, where 0<i< Q, 
sys = QUA. 

However, in the case of PASM, there are not usually 
Q independent tasks executing. For example, if an 
SIMD task of size RN/Q is being executed by the Paral- 
lel Computation Unit, the same instruction stream is 
being used by each of the R MCs which are controlling 
the task. The virtual page request rate to the Control 
Storage by the R MCs can be reduced from RA to A by 
having one MC make the page requests and having the 
Control Storage broadcast the page to alk R MC 
memory units simultaneously through the MC Memory 
System Switch (see Figure II.3). 

To determine the actual average system page request 
rate it is necessary to determine the average number of 
independent instruction streams or tasks being executed 
by the MCs. This discussion will be limited to the exe- 
cution of SIMD tasks. There are q+ 1 different sizes of 
SIMD tasks which can be controlled by, the Q MCs, 
where q = log.Q. A task may require 2’ MCs, where 
0<i<q.. Let the probability that an SIMD task 
requiring 2! MCs is created be P.. Let E. be the average 
execution time for tasks which require 2! MCs. The 
average value of the processor-time product for a task 
which requires 2' MCs is defined as the product of the 
average execution time and the number of MCs 
required, E; 2. The processor-time product can then be 
used to weight the P,; distribution to determine R,, the 
probability that a task requiring 2’ MCs will be execut- 
ing on any MC which has a task assigned to it. R; is 
defined as: 


P; E, 2! 
he 
OP, E; 2) 
j=0 


The average execution time parameters may be varied 
based on system use experience. For this analysis it is 
assumed that tasks of all MC requirements have equal 
execution time, so E; = E. Therefore, 


_ PE? ~ P22 
YIP,E 2 yp, 2) 
j=0 j=0 


In the analysis in this paper, a PASM with 16 MCs is 
assumed, i.e., q=4. For this analysis it is also assumed 
that distribution of the number of MCs required by a 
task is uniform, ie., P; = 1/5, where 0 <i< 4. Once 
again, this assumption can be varied based on system 
use experience. The probability that a task requiring 2! 
MCs is executing on a given assigned MC is R; = 2'/81, 
where 0 <1 < 4. The following theorem uses the above 
result for the probability that a task requiring 2' MCs 
will be executing on any given assigned MC to deter- 
mine the average system page request rate. 


Theorem 1: The average system page request rate, 


sivas 1S: 


a = Wadd ~ Ri, 
= 2 


where A is the virtual task page fault rate, U,,. 1s the 
MC utilization for all MCs, and R; is the probability 
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Controller. 


that a task which requires 2' MCs will be executing on 
any given assigned MC. 


Proof: Consider an SIMD task which requires 2' MCs. 
The set of MCs which is executing this task will be 
denoted by 5;. The virtual page fault rate for the task is 
A. When a page fault occurs, only one of the MCs in 
the set S; reports the fault to the Control Storage 
If MC,, jeS;, is reporting the page faults, 
vy. =Aandy, = 0 for all keS; and k4j. Therefore, from 
the set S; of 2! MCs, only one page request is generated 
for each task page fault. Thus, 


YY =A. 


keS; 


The average virtual page request rate for MC;, where 
jes; is defined as: 


Avegly; | jeS;] = ee 


keS; 


The notation Avg[x| denotes the average value of x 
The average of the virtual MC page request rates, 7, 
can then be calculated by taking the average value of y; 
over all possible task sizes. Hence, 


gi 


D = Aveglv;] = UR Avg|vy; v;| jeS;] = = ER 3 


The system page request rate, d,,,, is defined as: 


= 3 = Fuy,. 


j=0 


Assuming that the utilization for all of the MCs is 
the same, i.e., U; = U,,,, where0 <j < Q, the average 
value of the system page request rate, Newas | is: 


= F 
sys = Avgl\sysl of Avg| Ds UH 
j= 


= SAvglUjyj] = —— Und = _ QU? - 
j=0 


Substituting in the equation for D, 


Kays = QWUmeSR; + = QUAY = Ri, 
i=0 2 i=0 gi 
which is the desired result. 
C 


It is noted that the system page request rate is 
dependent on Q, the number of Micro Controllers and is 
independent of N, the number of Parallel Computation 
Unit processors. Theorem 1 is generalized to account 
for the fact that the task virtual page fault rate A may 
vary for tasks requiring different numbers of MCs in the 
following corollary. 


Corollary 1: The average system page request rate, 
sys» 1S: 
A: 
\ sys = QU ned m7 R; ? 


i=0 


where A; is the virtual page fault rate for the instruction 
stream of a task which requires 2’ MCs. 
Proof: Follows directly from the proof of Theorem 1. 

0 


When an MC is not executing, it is either waiting for 
a page request to be serviced by the Control Storage or 
it does not have a task assigned to it to execute. The 
virtual utilization is the utilization of the MC while it 
has a task assigned to it and is denoted by U,,. The 
assignment ratio is the fraction of time an MC has a 
task assigned to it. If A is the average MC assignment 
ratio, then U,, =AU,,. Note that if the MCs are 
always assigned tasks, then U,,, = U,,. The (real) page 
fault rate for a task may now be defined as U_A since 
the virtual utilization only accounts for the time that a 
task is assigned to a group of MCs. 

The multitasking level, T, is defined to be the 
number of tasks which are executing on the system at a 
given time. The average system page request rate may 
be defined in terms of T, the average multitasking level, 
and U,A, the (real) task page fault rate, to be: 
Neves = LU! A. Combining this with the result of 
Theorem 1, the average multitasking level is determined 
to be: 


Hence, instead of Q tasks executing, the average multi- 
tasking level is T, which for the uniform distribution 
case with all MC assigned tasks (A = 1) would be: 
80/31 and the average system page request rate is: 


80 


Move = TU,,.A = 31 


U_,A = 2.58 U_A , 


where U,,.A is the task page fault rate. 

In conclusion, in the case where all 16 MCs are exe- 
cuting tasks it might be expected that the average sys- 
tem page request rate would 16U,,.A, where U,,.A is the 
page fault rate for the task running on each MC. In 
this section it has been determined that the average 
page request rate for the system is only 2.58U,,,A when 
there is a uniform distribution of task sizes. Hence, the 
average system page request rate is only 16.1 per cent of 
what might be expected when all 16 MCs are executing 
tasks. The worst case system page fault rate is 16U,,.A 
which occurs when each MC is executing an indepen- 
dent task. On the other hand, when all MCs are exe- 
cuting the same task, the system page fault rate is 
U,, A. The average multitasking levels for a variety of 
P. distributions are given in Table V.1. 


VI. Optimal Control Storage Service Rate 


Criterion for optimal memory management in mul- 
tiprogrammed systems have been given in [2,4,8]. The 
optimum is characterized by maximal system service 
rate, and in turn by maximal processor utilization and 
minimal response time [4]. One such criterion is the 
50% criterion. The 50% criterion for optimal memory 
management states that in a multiprogrammed system 
with page request rate X, the use of the CPU is ‘‘optim- 
ized” when the disk service rate wp = 2d so that the disk 
is 50% utilized [8]. So for 4 < 2, as p is increased, the 
system service rate is increased significantly, and for 
pt > 2, as p is increased, the system service rate does 
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Table V.1: Average multitasking level, T, for a 
variety of task size distributions. The 
task size is the number of MCs a task 
requires. P, is the probability that a task 
which requires 2' MCs is created. 

0.20 0.20 0.20 0.20 0.20 2.08 
0.00 0.25 025 0.25 0.25 2.13 
0.00 0.00 038 0.33 0.33 1.72 
0.00 0.00 0.00 0.50 0.50 1.33 
0.00 0.00 0.00 0.00 = 1.00 1.00 
1.00 0.00 0.00 0.00 0.00 16.00 
0.50 0.50 0.00 0.00 0.00 10.67 
0.33 033 033 0.00 0.00 6.86 
0.25 0.25 0.25 0.25 0.00 3.79 
0.23 0.238 0.23 0.23 0.08 3.38 
0.50 0.00 0.00 0.00 - 0.50 1.88 
0.00 050 0.00 0.00 0.50 1.78 
0.00 0.00 0.50 0.00 0.50 1.60 
0.00 0.00 0.00 0.50 0.50 1.14 
0.00 0.00 0.50 0.50 0.00 2.67 
0.00 0.00 0.00 1.00 0.00 2.00 
0.00 0.00 1.00 0.00 0.00 4.00 
0.00 1.00 0.00 0.00 0.00 8.00 
0.00 033 033 0.33 0.00 3.34 
0.10 0.25 0.30 0.25 0.10 2.96 


not increase significantly. Thus, w = 2X is considered 
“optimal.” 

For the class of systems studied in [8], it was deter- 
mined that the utilization of the secondary storage 
which resulted in optimal memory management was 
50%. In order to determine the appropriateness of the 
50% criterion for the PASM MC secondary storage, MC 
utilization was used as a performance measure. Figure 
VI.1 is a graph of the MC utilization as a function of 
the Control Storage utilization which was generated 
from simulation data (details of simulation are in [11)). 
There are three optimal values for the Control Storage 
utilization in PASM: 32.5%, 50%, and 62.5%. They are 
all considered optimal since the increase in MC utiliza- 
tion resulting from a small decrease in Control Storage 
utilization is much less the the decrease in the MC utili- 
zation resulting from a small increase in the Control 
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Figure VI.1: 


Storage utilization. The selection of the optimal Con- 
trol Storage utilization to be used may depend on fac- 
tors such as desired speed and cost of available secon- 
dary storage devices (e.g., disks). 

It is noted that even when the Control Storage utili- 
zation is 0% (all page faults are serviced instantane- 
ously), the MC utilization in Figure VI.1 is not 100%. 
This is due to other factors which impact the MC utili- 
zation besides the Control Storage utilization, such as 
availability of tasks to be scheduled [12] and fragmenta- 
tion of MCs (i.e., available MCs do not form allowable 
group). For oe the MC utilization in Figure VI.1 
could be increased if the task interarrival rate was 
increased, but its shape would remain similar. 

To apply the optimal result to PASM, the Control 
Storage service_rate p must be selected so. that 
U., # =X sys me A, Where_U,, is the Control 
Storage utilization. Hence, p = (T U,, A) / U,, ._ If the 
optimal Control Storage utilization of 50% is selected, 
p= 2T U,, -A. In the case where there is a uniform 
distribution of the sizes of tasks created (derived in the 
previous sections), pe = 5.16 U,, -A. If the virtual MC 
utilization is assumed to be one, then p = 5.16 A. In 
actuality, U, - would be less than one, and a value other 
than one could be used here. Therefore, based on the 
assumption of a uniform distribution of the sizes of 
tasks created, the Control Storage service rate should be 
set to 5.16 times the virtual task page fault rate. 


Vil. Micro Controller Idle Time 


Since the MCs in PASM are not multiprogrammed, 
there is not another task for an MC to execute while it 
is waiting for a page request from its current task to be 
serviced by the Control Storage. In this section, opera- 
tional analysis is used to determine the MC idle time 
which results from an MC waiting for a page request to 
be serviced by the Control Storage. Note that this does 
not include the time that the MC is idle while it does 
not have a task assigned to it. This derivation makes 
use of Little’s Law and is similar to that of the 
“Interactive Response Time Formula” for a terminal 
system in [3]. Let mM be the mean queue length for a 
device (including the request which is being serviced); 
let Xq be the throughput of the device; and let R be the 
accumulated time at that device per request (time spent 
by the request in the queue of the device while waiting 
for service plus Ee service time of the device). Then 
Little’s Law [3] is: m = Xj) R 

Let I denote ne average MC idle time and Z denote 
the average time interval between when an MC resumes 
execution after a page request is serviced and when its 
next page fault occurs. Hence, Z is the average execu- 
tion time or busy time between page faults for a given 
MC. Each task is executed by a group of one or more 
MCs. Since each MC can have at most one instruction 
stream associated with it, the system has a finite custo- 
mer population [6] (i.e., there is a finite number of page 
requests waiting to be serviced by the Control Storage 
at any given time since each MC cannot have another 
page fault while it is waiting for its current request to 
be serviced). 

A task repeats ‘‘busy-idle cycles” while it has a 
group of MCs assigned to it. A busy-idle cycle has two 
phases: the busy phase, when the group of MCs which 
is assigned to the task is executing, and the idle-phase, 
when the group of MCs is waiting for a page fault to be 
serviced. The mean time for a task to complete one 
busy-idle cycle on a group of MCs is I+ Z. Note that 
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Z is the average time spent in the MC subsystem and I 
is the average time spent in the Control Storage subsys- 
tem during each busy-idle cycle (see Figure IV.1). Since 
all tasks which are assigned to MCs are repeating busy- 
idle cycles, M is equal to T, the average multitasking 
level. Let a be the throughput of the Control Storage. 
Applying Little’s Law, T = X,(I + Z). 

The throughput, Xo, is the product of the utilization, 
U,,, and the service rate, w. So the throughput of the 
Control Storage is U,,g@. The average execution time 
between page faults, Z, is 1/v where v is the MC virtual 
page fault rate. Therefore, the average MC idle time is: 


T 
Uncut 


T 


l=—- 


Xo 


i 
y 


This maps to the “Interactive Response Time Formula” 
in [3] by letting the MC busy time correspond to the 
user think time, the MC idle time correspond to the 
user wait time, the number of tasks executing (i.e., mul- 
titasking level) corresponds to the number of terminals, 
and the Control Storage throughput corresponds to the 
throughput of the central server. 

Suppose the results for the optimal service rate yp 
from the previous section are used. Then the Control 
Storage service rate wp = 5.16A and 

1 


OF _ T-U,,5.16 


| = —_—— == 
U,.5.16A A U,,5.16A 

Next the worst case situation is considered. In the 
worst case the Control Storage is completely utilized, so 
U., = 1 and 


_ T-5.16 
5.16A 


If the Control Storage is completely utilized, the system 
page fault rate 4,,, must be greater than or equal to the 
Control Storage service rate p, so A, > w. Hence, 
based on the 50% criterion and the assumptions used to 
select the Control Storage service rate p, r\.,, > 5.16A 
and T > 5.16. If during some time interval there are 16 
tasks executing (i.e., a multitasking level of 16), the MC 
idle time during that time interval would be: 


16—-— 5.16 _ 10.84 


|= = 
5.16A 5.16A 


1 

| 2.1 i” 

The time which an MC is busy executing a task is the 
time between page faults, 1/A. The time which an MC 
is waiting for a page request to be serviced is its idle 
time, I. During a time interval when there are 16 tasks 
executing, the fraction of time which a given MC would 
be idle is: | 


I as 2.1(1/A) oe 
T+ (1/A) 3a/A) 


So, if during some interval of time the average multi- 
tasking level was 16 (i.e., worst case level), the MCs 
would be idle 67.7% that time interval. Based on the 
simplifying assumptions in Section V used to compute 
the optimal Control Storage service rate yz, the probabil- 
ity of the worst case occurring is less than 0.1%. Note 
that if the probability had been greater, it would have 
impacted the calculation for the average system page 


fault rate ae which would have resulted in a faster 


Control Storage service rate pL. 


) VIII. Simulation Results 


In this section results from the PASMOS simulator 
[11] are given. The, simulator has been run with a 
variety of average execution times and distributions for 
the number of MCs a task requires. The results of all 
runs agree with the analytical result of Section V. As 
an example, consider the following simulation run where 
a random number generator was used to produce a uni- 
form distribution (i.e., P; = 0.2) for the number of MCs 
a task requires and the expected execution time for the 
tasks was fifteen seconds. The simulation ran for 
twenty thousand simulation seconds and over two 
thousand tasks were executed. The resulting distribu- 
tion for the number of MCs required by a task for the 
simulation was: Py = 0.191, P, = 0.204, P, = 0.198, 
P, = 0.208, and P, = 0.199. The resulting average exe- 
cution time for a task which_requires 2' MCs for the 
simulation was: Ey = 15.006, E, = 15.518, Ey = 14.215, 
E, = 14.715, and Ey = 15.869. The average MC assign- 
ment ratio A was 0.606 and the average multitasking 
level T was 1.531 streams. (Further details of the simu- 
lator are beyond the scope of this paper and are given in 
[11]. A description of the task scheduling algorithm 
which was used by the simulator is given in [12].) 

Using the equation from Section V for R;, the proba- 
bility that a task which requires 2' MCs is executing on 
a given assigned MC, with the E;s and Pis from the 
simulation, the results are found to be: Rg = 0.030, 
R, = 0.066, Ry = 0.118, Rs _ 0.256, and Ry = 0.529. 
Substituting the MC utilization and the Rs into the 
equation for the average multitasking level: 


16 


91 


4 
(0.608) XR, 
i=0 


mol 
T = ADDR; Q = = 1.530 , 
i=o 2 


the average multitaking level in found to be 1.530. 


Hence, by using the analytical method of Section V 
with the system characteristics from the simulation it 
has been determined that the average number of 
independent instruction streams is 1.530. The simula- 
tion results give the average number of instruction 
streams to be 1.531. Therefore, the simulation results 
support the analysis in Section V. 

The simulator may also be used to confirm the worst 
case MC idle time result from Section VII. The Control 
Storage service rate and task page fault rate were 
selected so that p = 5.16A. To create the worst case 
situation, the distribution of the number of MCs 
required by a given task was adjusted so that all tasks 
would require one MC. The average execution time was 
adjusted so that the assignment ratio for all MCs would 
be one and the average number of tasks executing,1’,, 
would approach sixteen. The resulting average multi- 
tasking level was 16.0; the Control Storage utilization, 
U.,, was 1.0; and the average MC was idle for 67.5% of 
the simulation time. This result agrees with the 
expected result from the analysis in Section VII, where 
in the worst case (i.e., the multitasking level is 16) the 
average MC was idle 67.7% of the time. Again it is 
noted that the probability that this worst case would 
occur is very small. 


IX. Relation to the General Multiple-SIMD Model 


A general model of a multiple-SIMD system is shown 
in Figure [X.1. There is a pool of control units with a 
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SWITCH 


POOL OF N 
PROCESSING 
ELEMENTS (PEs) 


INTERCONNECTION NETWORK 


Figure IX. 1: 


A general model of a multiple-SIMD 
machine. 


common secondary storage, a pool of processing ele- 
ments, a switch which is used to connect a control unit 
to a group of processing elements, and an interconnec- 
tion network for communication among the processing 
elements. In the case of PASM, the switch is fixed in 
that each processing element is connected to exactly one 
contro] unit (MC), and large machines are created by 
combining control units (MCs). Other MSIMD systems, 
such as MAP [9], use a crossbar type of switch to assign 
the processing elements to the control units. All of the 
control units are not always used. When PASM is exe- 
cuting an SIMD task only one of the MCs which is exe- 
cuting the task is used to make requests for pages. 

The optimal service rate analysis may be applied to 
the general case by letting the used control units 
correspond to the MCs which are making the requests 
and the unused control units correspond to the MCs 
which are executing tasks but not making page requests. 
Thus the analysis applies to the general case where the 
SIMD machines have a power of two processing ele- 
ments. The power of two constraint may also be eased 
to allow any size SIMD machine. © 


X. Conclusion 


In this paper a queueing network has been analyzed 
to determine the ‘‘optimal” page request service rate for 
the Control Storage of the PASM multimicrocomputer 
system. It has been shown that the optimal service rate 
for the PASM Control Storage is much lower than 
might be expected. Two possible methods for varying 
the Control Storage service rate include varying the 
number or type of disks, or changing the method of 
storing pages on the disks. Simplifying assumptions 
were made about average execution time, task page 
fault rate, the distribution of the number of MCs which 
a task requires, etc. Based on experience any or all of 
these assumptions can be changed to reflect actual or 
expected system characteristics. Operational analysis 
has been used to determine the MC idle time which 
results from MCs having to wait for page requests to be 
serviced. Simulation results have been given which sup- 
port the analytical result for the average number of 
independent instruction streams and the worst case MC 
idle time. 

This study can also be applied to the use of PASM 
in the MIMD mode of operation or in a combination of 


the MIMD and SIMD modes. When PASM is operating 
as a number of virtual MIMD machines of varying sizes, 
the MCs may be used to help coordinate the activities 
of the Parallel Computation Unit processors. The coor- 
dination activity of MIMD mode requires the MCs to 
execute significantly fewer instructions than in the con- 
trol activity of SIMD mode. Hence, the page request 
rate is significantly lower for MIMD mode than for 
SIMD mode. Since it is expected that in MIMD mode 
each MC will have its own instruction stream, it can be 
treated as one SIMD partition, and incorporated into 
the run-time statistics. 

In summary, a model has been developed for the 
PASM control system memory hierarchy. Using any 
combination of system feature assumptions and actual 
system characteristics (from experience), the model can 
be used to determine the “optimal” service rate for the 
Control Storage. Furthermore, using the model, values 
for the parameters which characterize the expected task 
environment and secondary storage service rate can be 
varied to determine the impact on MC utilization. The 
model can be adapted for use in any multiple-SIMD 
machine with common secondary storage for the multi- 
ple control units. 
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